[Computer Architecture] Pipeling

Previously in CS211

  • Iron law of performance:
    • time/program = insts/program * cycles/inst * time/cycle
  • Classic 5-stage RISC pipeline
  • Structural, data, and control hazards
  • Structural hazards handled with interlock or more hardware
  • Data hazards include RAW, WAR, WAW
    • Handle data hazards with interlock, bypass, or speculation
  • Control hazards (branches, interrupts) most difficult as change which is next instruction
    • Branch prediction commonly used
  • Precise traps: stop cleanly on one instruction, all previous instructions completed, no following instructions have changed architectural state
    ## Trap

An external or internal event that needs to be processed by another (system) program. The event is usually unexpected or rare from program’s point of view.

Handler

  • Save \(EPC\) before enabling interrupts to allow nested interrupts
    • need instruction to move \(EPC\) into GPR
    • need a way to mask further interrupts.
  • Needs to read a status register that indicates the cause of the trap.
  • jump inst \(ERET\)

Synchronous Trap

  • A synchronous trap is caused by an exception on a
  • In general, the instruction cannot be completed and needs to be restarted after the exception has been handled
  • requires undoing the effect of one or more partially executed instructions
  • In the case of a system call trap, the instruction is considered to have been completed
  • a special jump instruction involving a change to a privileged mode

Speculating on Exceptions • Prediction mechanism

  • Exceptions are rare, so simply predicting no exceptions is very accurate!
  • Check prediction mechanism
  • Exceptions detected at end of instruction execution pipeline, special
    hardware for various exception types • Recovery mechanism
  • Only write architectural state at commit point, so can throw away partially executed instructions after exception
  • Launch exception handler after flushing pipeline
  • Bypassing allows use of uncommitted instruction results by following instructions

Compile for Tianhe

without parallel

environment

module load zlib/1.2.6
module load Intel_compiler/13.0
module load autoconf/2.69

install precedure automake libtool & autoconf
Then install omp 4.0.5

compile omp

export HDF5=${HOME}/CESM/hdf5-intel
export PATH=\$HDF5/bin:\$PATH
export LD_LIBRARY_PATH=\$HDF5/lib:\$LD_LIBRARY_PATH
export INCLUDE=\$HDF5/include:$INCLUDE
export CPPFLAGS=-I\$HDF5/include
export LDFLAGS=-L\$HDF5/lib

netcdf-c

CC=icc CXX=icpc FC=ifort ./configure --prefix=${HOME}/CESM/netcdf-c-intel --enable-parallel-tests --di^Cble-dap

with parallel

本机的不能用

         SUMMARY OF THE HDF5 CONFIGURATION
            =================================

General Information:
-------------------
                   HDF5 Version: 1.12.0
                  Configured on: Tue Sep 15 00:26:24 CST 2020
                  Configured by: shlwang@ln8
                    Host system: x86_64-unknown-linux-gnu
              Uname information: Linux ln8 2.6.32-358.11.1.2.ky3.1.x86_64 #1 SMP Mon Jul 8 13:05:58 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
                       Byte sex: little-endian
             Installation point: ${HOME}/CESM/hdf5-intel-parallel

Compiling Options:
------------------
                     Build Mode: debug
              Debugging Symbols: yes
                        Asserts: yes
                      Profiling: no
             Optimization Level: debug

Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables:
                        LDFLAGS: -L${HOME}/CESM/hdf5-intel/lib
                     H5_LDFLAGS:
                     AM_LDFLAGS:
                Extra libraries: -lz -ldl -lm
                       Archiver: ar
                       AR_FLAGS: cr
                         Ranlib: ranlib

Languages:
----------
                              C: yes
                     C Compiler: /THL4/software/mpi/mpi-intel2013/bin/mpicc ( MPICH version 3.0.4 built with icc version 13.0.0 (gcc version 4.7.0 compatibility))
                       CPPFLAGS: -I${HOME}/CESM/hdf5-intel/include
                    H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200809L   -UNDEBUG -DH5AC_DEBUG -DH5B2_DEBUG -DH5CX_DEBUG -DH5D_DEBUG -DH5F_DEBUG -DH5HL_DEBUG -DH5I_DEBUG -DH5O_DEBUG -DH5S_DEBUG -DH5ST_DEBUG -DH5T_DEBUG -DH5Z_DEBUG -DH5_DEBUG_API
                    AM_CPPFLAGS:
                        C Flags:
                     H5 C Flags:   -std=c99 -Wall -Wcheck  -g  -O0
                     AM C Flags:
               Shared C Library: yes
               Static C Library: yes


                        Fortran: no

                            C++: no

                           Java: no


Features:
---------
                   Parallel HDF5: yes
Parallel Filtered Dataset Writes: yes
              Large Parallel I/O: yes
              High-level library: yes
                Build HDF5 Tests: yes
                Build HDF5 Tools: yes
                    Threadsafety: no
             Default API mapping: v112
  With deprecated public symbols: yes
          I/O filters (external): deflate(zlib)
                             MPE:
                   Map (H5M) API: no
                      Direct VFD: no
              (Read-Only) S3 VFD: no
            (Read-Only) HDFS VFD: no
                         dmalloc: no
  Packages w/ extra debug output: AC,B2,CX,D,F,HL,I,O,S,ST,T,Z
                     API tracing: yes
            Using memory checker: no
 Memory allocation sanity checks: yes
          Function stack tracing: no
       Strict file format checks: yes
    Optimization instrumentation: yes

comepiled MPIIO for HDF5

compiled pnet for HDF5

CC=icc CXX=icpc FC=ifort ./configure –prefix=${HOME}/CESM/netcdf-c-intel –enable-parallel-tests –disable-dap

# NetCDF C Configuration Summary
==============================

# General
-------
NetCDF Version:         4.7.4
Dispatch Version:       2
Configured On:          Tue Sep 15 00:51:19 CST 2020
Host System:            x86_64-pc-linux-gnu
Build Directory:        ${HOME}/CESM2/netcdf-c-4.7.4
Install Prefix:         ${HOME}/CESM/netcdf-c-intel-parallel

# Compiling Options
-----------------
C Compiler:             /THL4/software/mpi/mpi-intel2013/bin/mpicc
CFLAGS:
CPPFLAGS:               -I${HOME}/CESM/hdf5-intel-parallel/include
LDFLAGS:                -L${HOME}/CESM/hdf5-intel-parallel/lib
AM_CFLAGS:
AM_CPPFLAGS:
AM_LDFLAGS:
Shared Library:         yes
Static Library:         yes
Extra libraries:        -lhdf5_hl -lhdf5 -lm -lz

# Features
--------
NetCDF-2 API:           yes
HDF4 Support:           no
HDF5 Support:           yes
NetCDF-4 API:           yes
NC-4 Parallel Support:  yes
PnetCDF Support:        no
DAP2 Support:           no
DAP4 Support:           no
Byte-Range Support:     no
Diskless Support:       yes
MMap Support:           no
JNA Support:            no
CDF5 Support:           yes
ERANGE Fill Support:    no
Relaxed Boundary Check: yes
SZIP Support:           no
SZIP Write Support:     no
Parallel Filters:       yes

修改xml credit:MrXun_

$ vim Macros.make
# 修改以下内容
MPIFC := mpiif90
MPICC := mpiicc
MPICXX := mpiicxx
# 在 endif 后面加入
NETCDF_PATH := 你的NetCDF安装路径
HDF5_PATH :=  你的HDF5安装路径

SLIBS += -L${NETCDF_PATH}/lib -lnetcdff -lnetcdf -L${HDF5_PATH}/lib -lhdf5_hl -lhdf5 -ldl -lm -lz -lcurl -L${MKLROOT}/lib/intel64 -lmkl_rt -lpthread -ldl
LDFLAGS += -L${NETCDF_PATH}/lib -lnetcdff -lnetcdf -L${HDF5_PATH}/lib -lhdf5_hl -lhdf5 -ld -lm -lz -lcurl -L${MKLROOT}/lib/intel64 -lmkl_rt -lpthread-ldl

# 注释下面的代码
#ifeq ($(MPILIB),mvapich2)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpich2)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpt)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),openmpi)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpich)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mvapich)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),impi)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpi-serial)
#SLIBS := $(SLIBS) #-mkl
#endif

如何创建

/create_newcase --case /data/gpfs01/$USER/cases/g.e20.G1850ECO.f19_g17.test --compset G1850ECO --res f19_g17 --run-unsupported   (蓝色标注部分为 创建案例的路径,需修改)
用xmlchange命令修改设定,比如
./xmlchange NTASKS=12
./xmlchange STOP_OPTION=nmonths, STOP_N=5
之后setup、build、submit任务即可:
./case.setup
./case.build
./case.submit

[Computer architecture] Microcode Instruction, ISA

lab0

snipher
指令怎么处理,的到trace
report gradescope

HW1

2 week 4 ques

Paper reading

one compulsory and one optional

ISA

Stack and accumulator

Stack vs. Accumulator(implicitly)
e.g. C \(\leftarrow\) A + B

Stack Accumulator GPR GPR
Push A Load A Load R1, A Load R1, A
Push B Add B Add R3, R1, B Load R2, B
Add Store C Store R3, C
Pop C

Stack: no register, but stack

  • Pros
    • Simple Model of expression evaluation (Reverse Polish Notation)
    • Short instruction, i.e., push, pop, etc.
  • Cons
    • Stackcan‘tberandomlyaccessed
    • Stack accessed every operation, to be a bottleneck
      Accumulator: one register, i.e., accumulator
  • Pros
    • Shortinstructions
  • Cons
    • Accumulator is only temporary storage, thus with high memory traffics.

CISC vs. RISC

  • CISC
    • Complex instruction set computer • Rep: x86
  • RISC
    • Reduced instruction set computer • Reps: RSIC-V, MIPS, SPARC
  • Main features of RISC, in contrast to CISC
    • A large number of registers and a highly regular instruction pipeline, allowing a low number of clock cycles per instruction (CPI) for high throughput
      • SPARC and RISC-V both with 32 general-purpose integer registers
      • X86, 8 general-purpose integer registers
    • Uniform instruction format
    • Load-store architecture
      • Only load and store instruction can access memory

Hardwired vs. Microcoded

  • Microcoded control (PLA)
    • Implemented using ROMs/RAMs
    • Indirect next_state function: “here’s how to compute next state”
    • Slower ... but can do complex instructions
    • Multi-cycle execution (of control)
  • Hardwired control
    • Implemented using logic (“hardwired” can’t re-program)
    • Direct next_state function: “here is the next state”
    • Faster ... for simple instructions (speed is function of complexity)
    • Single-cycle execution (of control)

Why Microcode

How RISC to CISC and then to RISC

Control vs. Datapath

control can be split between datapath, where numbers are stored and arithmetic operations computed and control, which sequences operations on datapath.

Single-Bus Datapath for Mircrocoded RISC-V


Microinstructions written as register transfers:

  • MA:=PC means RegSel=PC; RegW=0; RegEn=1; MALd=1
  • B:=Reg[rs2] means RegSel=rs2; RegW=0; RegEn=1; BLd=1
  • Reg[rd]:=A+B means ALUop=Add; ALUEn=1; RegSel=rd; RegW=1

Microcode Sketches

Instruction Fetch:

  • MA,A:=PC
  • PC:=A+4
  • wait for memory
  • IR:=Mem
  • dispatch on opcode

Pure Rom Implementation

  • Instruction fetch sequence 3 common steps
  • ~12 instruction groups
  • Each group takes ~5 steps (1 for dispatch)
  • Total steps 3+12*5 = 63, needs 6 bits for µPC
  • Opcode is 5 bits, ~18 control signals
  • Total size = \(2^{6+5+2}*(6+18)=2^{13}*24 = ~25KiB\)!

Reduce Control Store Size

  • Reduce ROM height (#address bits)
    • Use external logic to combine input signals
    • Reduce #states by grouping opcodes
  • Reduce ROM width (#data bits)
    • Restrict µPC encoding (next, dispatch, wait on memory, …)
    • Encode control signals (vertical µcoding, nanocoding)

Microcoded r Hardcoded -> Hardwired

Horizontal vs. Vertical µCode

  • Horizontal µcode has wider µinstructions
    • Multiple parallel operations per µinstruction
    • Fewer microcode steps per macroinstruction
    • Sparser encoding ⇒ more bits
  • Vertical µcode has narrower µinstructions
    • Typically a single datapath operation per µinstruction
      • separate µinstruction for branches
    • More microcode steps per macroinstruction
    • More compact ⇒ less bits but requires more capability of paralization
  • Nanocoding
    • Tries to combine best of horizontal and vertical µcode

WCS 可编程

ROP

buffer overflow 攻击 Return-Oriented-Programing

图灵完备 在这个系统中写程序能够找到解决方法(尽管不保证运行时和内存)

[Database] SQL

Admin
Grading

  • Homework: 20%
  • Quizs: 10%
  • Course project: 25%
  • Midterm: 20%
  • Final exam: 25%

Reason to have a Database

Unitility

  • Data processing backs essentially every app
  • Databases of one form or another back most apps
  • The principles taught in this class back nearly everything in computing

Centrality

  • Data is at the center of modern society.
  • Unprecedented in its nature and significance
    • Particular and voluminous
    • Often asymmetric
      • low value in isolation, high value when aggregated
    • Difficult to protect
  • The infrastructure determines what’s possible

[Database] SQL Cond.

习题课 - 1A106 - 周一8-9

关系型数据库

  • 容器存储关系的集合
    • Relation
      • schema - fixed
      • Instance - change often
        • multiset
    • Tuple
    • Attribute
    • DDL
      • Sailor

        Group by 聚合函数或select
    • DML
SELECT S.dept, AVG(S.gpa), COUNT(\*) FROM Students S
WHERE S.gender = 'F'
GROUP BY S.dept
HAVING COUNT(\*) >= 2 ORDER BY S.dept;

Distinct Aggregate

SELECT COUNT(**DISTINCT** S.name) 
FROM Students S
WHERE S.dept = 'CS';

2

SELECT **DISTINCT** COUNT(S.name) 
FROM Students S
WHERE S.dept = 'CS';

10

e.g

SELECT S.name, AVG(S.gpa) 
FROM Students S
GROUP BY S.dept;

name has multiple of them. Should be S.name or coercion function

SUMMARY for SQL1

  • Relational model has well-defined query semantics
  • Modern SQL extends “pure” relational model (some extra goodies for duplicate row, non-atomic types... more in next lecture)
  • Typically, many ways to write a query
    • DBMS figures out a fast way to execute a query, regardless of how it is written.

DML 多表

SELECT [DISTINCT]

FROM <single table>
[WHERE <predicate>]
[GROUP BY <column list>
[HAVING <predicate>] ] [ORDER BY <column list>] [LIMIT <integer>];

select - collection

join Queries 链接查询

SELECT [DISTINCT] <column expression list>
FROM <table1 [AS t1], ... , tableN [AS tn]> 
[WHERE <predicate>]
[GROUP BY <column list>[HAVING <predicate>] ] [ORDER BY <column list>];

cross(Catesian) product

All pairs of tuples, concatenated

Use where to filter - 有预约记录的信息

SELECT S.sid, S.sname, R.bid
FROM Sailors AS S, Reserves AS R 
WHERE S.sid=R.sid #外键==内键

先找到再去除,先行后列

AS is column Name and Table Aliases

self-join

AS can be used as the product's col name.

In a must in this case

Arithmetic Expression
SELECT S.age, S.age-5 AS age1, 2*S.age AS age2 
FROM Sailors AS S
WHERE 2*S.age = S2.age - 1

SQL calculator

SELECT
log(1000) as three,
exp(ln(2)) as two,
cos(0) as one,
ln(2*3) = ln(2) + ln(3) as sanity;

4 - 1

string comparisons

Old School SQL (Like)

SELECT S.sname
FROM Sailors S
WHERE S.sname LIKE 'B_%’

Standard Regular Expressions

SELECT S.sname
FROM Sailors S 
WHERE S.sname ~ 'B.*’

combining Predicates

Subtle connections between:

  • Boolean logic in WHERE (i.e., AND, OR)
  • Traditional Set operations (i.e. INTERSECT, UNION)

Sid’s of sailors who reserved a red OR a green boat.

SELECT R.sid
FROM Boats B, Reserves R 
WHERE R.bid=B.bid AND (B.color='red' OR B.color='green')

Sid’s of sailors who reserved a red AND a green boat.

SELECT R.sid
FROM Boats B, Reserves R
WHERE R.bid=B.bid AND B.color='red'

UNION ALL

SELECT R.sid
FROM Boats B, Reserves R
WHERE R.bid=B.bid AND B.color='green'

UNION ALL ~ OR set operation
INTERSECT ~ AND set operation

set semantics

  • Set: acollection of distinct elements
  • Standard ways of manipulating / combining sets
    • Union
    • Intersect
    • Except
  • Treat tuples within a relation as elements of a set

Default to set semantics

Relational tables

SQL Relation
Set/Multiset Set
ordering non-order

Table is just the interpretation of the relation.

Distinct

\(A=<x,y>\)
\(B=<y>\)
\(A/B=<x>\)
Relational Division: “Find sailors who’ve reserved all boats.” Said differently: “sailors with no counterexample missing boats”

SELECT S.sname FROM Sailors S WHERE NOT EXISTS
(SELECT B.bid
FROM Boats B
WHERE NOT EXISTS (SELECT R.bid
FROM Reserves R WHERE R.bid=B.bid AND R.sid=S.sid ))

ARGMAX

max's other property

SELECT *
FROM Sailors S WHERE S.rating >= ALL
(SELECT S2.rating FROM Sailors S2)
SELECT *
FROM Sailors S WHERE S.rating =
(SELECT MAX(S2.rating) FROM Sailors S2)

both are okay, the second is not stable and does not produce 并列。

INNER JOIN

减少cross product 的and

SELECT s.*, r.bid
FROM Sailors s, Reserves r WHERE s.sid = r.sid
AND ...
SELECT s.*, r.bid
FROM Sailors s INNER JOIN Reserves r ON s.sid = r.sid
WHERE ...
join variable
SELECT <column expression list> FROM table_name
[INNER | NATURAL
| {LEFT |RIGHT | FULL } {OUTER}] JOIN
table_name
ON <qualification_list> WHERE ...

The shadowed manifest list capacity.

Left Outer join
SELECT s.sid, s.sname, r.bid
FROM Sailors2 s LEFT OUTER JOIN Reserves2 r 
ON s.sid = r.sid;

不匹配赋NULL
行数和sailor一样

SELECT r.sid, b.bid, b.bname
FROM Reserves2 r RIGHT OUTER JOIN Boats2 b 
ON r.bid = b.bid

不匹配赋NULL
行数和boat一样

NATURAL & INNER
SELECT s.sid, s.sname, r.bid FROM Sailors s, Reserves r WHERE s.sid = r.sid
AND s.age > 20;
SELECT s.sid, s.sname, r.bid
FROM Sailors s INNER JOIN Reserves r ON s.sid = r.sid
WHERE s.age > 20;
SELECT s.sid, s.sname, r.bid
FROM Sailors s NATURAL JOIN Reserves r WHERE s.age > 20;

Equal, natural 只输出on 行成立的部分

FULL OUTER JOIN

「s.bname」不存在它置为NULL
「b.name, b.bid」不存在它置为NULL

Views

Named Queries

  • Makes development simpler
  • Often used for security
  • Not Materialized
CREATE VIEW Redcount
AS SELECT B.bid, COUNT(*) AS scount FROM 
Boats2 B, Reserves2 R
WHERE R.bid=B.bid AND B.color='red' GROUP BY B.bid

Subqueries in FROM

Like a “view on the fly”!

SELECT bname, scount FROM Boats2 B,
(SELECT B.bid, COUNT (*)
FROM Boats2 B, Reserves2 R
WHERE R.bid = B.bid AND B.color = 'red' GROUP BY B.bid) AS Reds(bid, scount)
WHERE Reds.bid=B.bid AND scount < 10

Common table experssion (WITH)

another view on the fly

WITH Reds(bid, scount) AS
(SELECT B.bid, COUNT (*)
FROM Boats2 B, Reserves2 R
WHERE R.bid = B.bid AND B.color = 'red' GROUP BY B.bid)

ARGMAX GROUP BY

The sailor with the highest rating per age

WITH maxratings(age, maxrating) AS (SELECT age, max(rating)
FROM Sailors
GROUP BY age)
SELECT S.*
FROM Sailors S, maxratings m
WHERE S.age = m.age
AND S.rating = m.maxrating;

NULL values

  • Field values are sometimes unknown
    – SQL provides a special value NULL for such situations.
    – Every data type can be NULL
  • The presence of null complicates many issues. E.g.:
    – Selection predicates (WHERE)
    – Aggregation
  • But NULLs comes naturally from Outer joins

NULL op x is NULL
Explicit NULL checks

SELECT * FROM sailors WHERE rating IS NOT NULL;

if NULL is after WHERE, it will not output

NULL and Aggregation

如何在Python logging.Formatter 格式化

我目前正在尝试居中与👉Python记录器中的日志记录级别字段,输出如下:

[    test_log    ][    DEBUG]  test (color_logger.py:66)
[    test_log    ][     INFO]  test (color_logger.py:67)
[    test_log    ][  WARNING]  test (color_logger.py:68)
[    test_log    ][    ERROR]  test (color_logger.py:69)
[    test_log    ][ CRITICAL]  test (color_logger.py:70)

但看起来像:

[__main__][DEBUG]  test (color_logger.py:67)
[__main__][INFO]  test (color_logger.py:68)
[__main__][WARNING]  test (color_logger.py:69)
[__main__][ERROR]  test (color_logger.py:70)
[__main__][CRITICAL]  test (color_logger.py:71)

有两个问题,

  • funcName 而不是 name

  • 得考虑右对齐和居中

解决方法

import logging

BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(8)

#The background is set with 40 plus the number of the color, and the foreground with 30
#These are the sequences need to get colored ouput
RESET_SEQ = "\033[0m"
COLOR_SEQ = "\033[1;%dm"
BOLD_SEQ = "\033[1m"

def formatter_message(message, use_color = True):
    if use_color:
        message = message.replace("$RESET", RESET_SEQ).replace("$BOLD", BOLD_SEQ)
    else:
        message = message.replace("$RESET", "").replace("$BOLD", "")
    return message

COLORS = {
    'WARNING': YELLOW,
    'INFO': WHITE,
    'DEBUG': BLUE,
    'CRITICAL': YELLOW,
    'ERROR': RED
}

class ColoredFormatter(logging.Formatter):
    def __init__(self, msg, use_color = True):
        logging.Formatter.__init__(self, msg)
        self.use_color = use_color

    def format(self, record):
        levelname = record.levelname
        if self.use_color and levelname in COLORS:
            levelname_color = COLOR_SEQ % (30 + COLORS[levelname]) + levelname + RESET_SEQ
            record.levelname = levelname_color
        return logging.Formatter.format(self, record)



# Custom logger class with multiple destinations
class ColoredLogger(logging.Logger):
    FORMAT = "[$BOLD" + "%(funcName)s".center(20," ")+"$RESET]["+ "%(levelname)20s" +"]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
    COLOR_FORMAT = formatter_message(FORMAT, True)
    def __init__(self, name):
        logging.Logger.__init__(self, name, logging.DEBUG)                

        color_formatter = ColoredFormatter(self.COLOR_FORMAT)

        console = logging.StreamHandler()
        console.setFormatter(color_formatter)

        self.addHandler(console)
        return

关于std::sort函数 开源可见的版本是否是跑的最快的版本。

https://godbolt.org/z/jxoYx7

我们可以很清楚的看到都是call 函数。我看到的源码有用intro sort 实现的gnu libcxx 也有分桶排序 的llvm 的版本,可是linux distro都是出厂自带,且每个公司都会对自己的代码进行魔改。今天被dhz的问题问到了,我还是cpp学艺不精。

zlw 曾经问过我一个问题。是std::sort 非常快,且看不到源码,这和我看到的好像不太一致,我从来没调试过glibc 和libcxx.so 源码和自带的不一样,谁知道呢?

如果glibc可以插一些奇怪的东西,那就太棒了。

linux team's motto: make our guest quicker

Parallel Computing Final Project FineTunedHashML Spring 2020

The final repo is stored at gitlab, added some hash function to the original algorithm.

Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSH-Deep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)

SLIDE is a project to reimplement BP on the CPU. After the LSH pre-processing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations- Sparse Backpropagation or Gradient Update.

After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.

As research in the paper, we can see that the main overhead is pre-processing on CPU avx512 with huge pages cache-heap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/src-d/minhashcuda

Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.

My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?

Thanks a lot!

[RL] Monte-Carlo Methods

Monte-Carlo RL

  1. MC learn directly from episodes of experience.
  2. MC is model-free. No knowledge of MDP transitions or rewraeds.
  3. MC learns fro complete episodes. no bootstrapping
    1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in \(SARSA(\lambda)\) or \(Q(\lambda)\)
    2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
  4. MC uses the simplest possible idea: value = mean return
  5. 2 ways:
    1. model-free: no model necessary and still attains optimality
    2. Simulated: needs only a simulation, not a full model
  6. Caveat: can only apply MC to episodic MDPs
    1. All episodes must terminate

policy evaluation

The goal is to learn \(v_\pi\) from episodes of experience under policy \(\pi\)
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: \(G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T\).

The value function is the expected return : \(v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]\)
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence


In fact it's the "backward mean \(\frac{S(s)}{N(s)}\)" to update \(V(s)\)

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean \(\mu_1,\mu_2.... \) of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

  1. TD methods learn directly from episodes of experience
  2. TD is model-free:no knowledge of MDP transitions/rewards
  3. TD learns from incomplete episodes, by bootstrapping
  4. TD updates a guess towards a guess

\((number)\) the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.




MC make a nearest prediction


We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

  1. TD can learn before knowing the final outcome
    1. TD can learn online after every step (less memory & peak computation)
    2. MC must wait until end of episode before return is known
  2. TD can learn without the final outcome
    1. TD can learn from incomplete sequences
    2. MC can only learn from complete sequences
    3. TD works in continuing (non-terminating) environments
    4. MC only works for episodic (terminating) environment
      ### result is TD performs better in random walk


batch MC and TD

add step parameter \(\alpha\)

Unified comparison

\(\to\)

  • Bootstrapping: update involves an estimate
    • MC does not bootstrap
    • DP bootstraps
    • TD bootstraps
  • Sampling:update samples an expectation
    • MC samples
    • DP does not sample
    • TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}{(1)}}{G_{t}{(2)}}=\frac{R_{t+1}+\gamma V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}{(\infty)}=R_{t+1}+\gamma R_{t+2}+\ldots \gamma{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}{(n)}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma{n-1} R_{t+n}+\gamma{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}{(n)}-V\left(S_{t}\right)\right)\end{array}
$$

Reference

  1. What exactly is bootstrapping in reinforcement learning?
  2. First-visit Monte Carlo policy from WSU