Study 📕 - Page 16 - vickieGPT’s blog

September 23, 2020February 9, 2022

[Computer Architecture] Pipeling

Previously in CS211

Iron law of performance:
- time/program = insts/program * cycles/inst * time/cycle
Classic 5-stage RISC pipeline
Structural, data, and control hazards
Structural hazards handled with interlock or more hardware
Data hazards include RAW, WAR, WAW
- Handle data hazards with interlock, bypass, or speculation
Control hazards (branches, interrupts) most difficult as change which is next instruction
- Branch prediction commonly used
Precise traps: stop cleanly on one instruction, all previous instructions completed, no following instructions have changed architectural state
## Trap

An external or internal event that needs to be processed by another (system) program. The event is usually unexpected or rare from program’s point of view.

Handler

Save $EPC$ before enabling interrupts to allow nested interrupts
- need instruction to move $EPC$ into GPR
- need a way to mask further interrupts.
Needs to read a status register that indicates the cause of the trap.
jump inst $ERET$

Synchronous Trap

A synchronous trap is caused by an exception on a
In general, the instruction cannot be completed and needs to be restarted after the exception has been handled
requires undoing the effect of one or more partially executed instructions
In the case of a system call trap, the instruction is considered to have been completed
a special jump instruction involving a change to a privileged mode

Speculating on Exceptions • Prediction mechanism

Exceptions are rare, so simply predicting no exceptions is very accurate!
Check prediction mechanism
Exceptions detected at end of instruction execution pipeline, special
hardware for various exception types • Recovery mechanism
Only write architectural state at commit point, so can throw away partially executed instructions after exception
Launch exception handler after flushing pipeline
Bypassing allows use of uncommitted instruction results by following instructions

September 14, 2020February 8, 2022

without parallel

environment

module load zlib/1.2.6
module load Intel_compiler/13.0
module load autoconf/2.69

install precedure automake libtool & autoconf
Then install omp 4.0.5

compile omp

export HDF5=${HOME}/CESM/hdf5-intel
export PATH=\$HDF5/bin:\$PATH
export LD_LIBRARY_PATH=\$HDF5/lib:\$LD_LIBRARY_PATH
export INCLUDE=\$HDF5/include:$INCLUDE
export CPPFLAGS=-I\$HDF5/include
export LDFLAGS=-L\$HDF5/lib

netcdf-c

CC=icc CXX=icpc FC=ifort ./configure --prefix=${HOME}/CESM/netcdf-c-intel --enable-parallel-tests --di^Cble-dap

with parallel

本机的不能用

         SUMMARY OF THE HDF5 CONFIGURATION
            =================================

General Information:
-------------------
                   HDF5 Version: 1.12.0
                  Configured on: Tue Sep 15 00:26:24 CST 2020
                  Configured by: shlwang@ln8
                    Host system: x86_64-unknown-linux-gnu
              Uname information: Linux ln8 2.6.32-358.11.1.2.ky3.1.x86_64 #1 SMP Mon Jul 8 13:05:58 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
                       Byte sex: little-endian
             Installation point: ${HOME}/CESM/hdf5-intel-parallel

Compiling Options:
------------------
                     Build Mode: debug
              Debugging Symbols: yes
                        Asserts: yes
                      Profiling: no
             Optimization Level: debug

Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables:
                        LDFLAGS: -L${HOME}/CESM/hdf5-intel/lib
                     H5_LDFLAGS:
                     AM_LDFLAGS:
                Extra libraries: -lz -ldl -lm
                       Archiver: ar
                       AR_FLAGS: cr
                         Ranlib: ranlib

Languages:
----------
                              C: yes
                     C Compiler: /THL4/software/mpi/mpi-intel2013/bin/mpicc ( MPICH version 3.0.4 built with icc version 13.0.0 (gcc version 4.7.0 compatibility))
                       CPPFLAGS: -I${HOME}/CESM/hdf5-intel/include
                    H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200809L   -UNDEBUG -DH5AC_DEBUG -DH5B2_DEBUG -DH5CX_DEBUG -DH5D_DEBUG -DH5F_DEBUG -DH5HL_DEBUG -DH5I_DEBUG -DH5O_DEBUG -DH5S_DEBUG -DH5ST_DEBUG -DH5T_DEBUG -DH5Z_DEBUG -DH5_DEBUG_API
                    AM_CPPFLAGS:
                        C Flags:
                     H5 C Flags:   -std=c99 -Wall -Wcheck  -g  -O0
                     AM C Flags:
               Shared C Library: yes
               Static C Library: yes


                        Fortran: no

                            C++: no

                           Java: no


Features:
---------
                   Parallel HDF5: yes
Parallel Filtered Dataset Writes: yes
              Large Parallel I/O: yes
              High-level library: yes
                Build HDF5 Tests: yes
                Build HDF5 Tools: yes
                    Threadsafety: no
             Default API mapping: v112
  With deprecated public symbols: yes
          I/O filters (external): deflate(zlib)
                             MPE:
                   Map (H5M) API: no
                      Direct VFD: no
              (Read-Only) S3 VFD: no
            (Read-Only) HDFS VFD: no
                         dmalloc: no
  Packages w/ extra debug output: AC,B2,CX,D,F,HL,I,O,S,ST,T,Z
                     API tracing: yes
            Using memory checker: no
 Memory allocation sanity checks: yes
          Function stack tracing: no
       Strict file format checks: yes
    Optimization instrumentation: yes

comepiled MPIIO for HDF5

compiled pnet for HDF5

CC=icc CXX=icpc FC=ifort ./configure –prefix=${HOME}/CESM/netcdf-c-intel –enable-parallel-tests –disable-dap

# NetCDF C Configuration Summary
==============================

# General
-------
NetCDF Version:         4.7.4
Dispatch Version:       2
Configured On:          Tue Sep 15 00:51:19 CST 2020
Host System:            x86_64-pc-linux-gnu
Build Directory:        ${HOME}/CESM2/netcdf-c-4.7.4
Install Prefix:         ${HOME}/CESM/netcdf-c-intel-parallel

# Compiling Options
-----------------
C Compiler:             /THL4/software/mpi/mpi-intel2013/bin/mpicc
CFLAGS:
CPPFLAGS:               -I${HOME}/CESM/hdf5-intel-parallel/include
LDFLAGS:                -L${HOME}/CESM/hdf5-intel-parallel/lib
AM_CFLAGS:
AM_CPPFLAGS:
AM_LDFLAGS:
Shared Library:         yes
Static Library:         yes
Extra libraries:        -lhdf5_hl -lhdf5 -lm -lz

# Features
--------
NetCDF-2 API:           yes
HDF4 Support:           no
HDF5 Support:           yes
NetCDF-4 API:           yes
NC-4 Parallel Support:  yes
PnetCDF Support:        no
DAP2 Support:           no
DAP4 Support:           no
Byte-Range Support:     no
Diskless Support:       yes
MMap Support:           no
JNA Support:            no
CDF5 Support:           yes
ERANGE Fill Support:    no
Relaxed Boundary Check: yes
SZIP Support:           no
SZIP Write Support:     no
Parallel Filters:       yes

修改xml credit:MrXun_

$ vim Macros.make
# 修改以下内容
MPIFC := mpiif90
MPICC := mpiicc
MPICXX := mpiicxx
# 在 endif 后面加入
NETCDF_PATH := 你的NetCDF安装路径
HDF5_PATH :=  你的HDF5安装路径

SLIBS += -L${NETCDF_PATH}/lib -lnetcdff -lnetcdf -L${HDF5_PATH}/lib -lhdf5_hl -lhdf5 -ldl -lm -lz -lcurl -L${MKLROOT}/lib/intel64 -lmkl_rt -lpthread -ldl
LDFLAGS += -L${NETCDF_PATH}/lib -lnetcdff -lnetcdf -L${HDF5_PATH}/lib -lhdf5_hl -lhdf5 -ld -lm -lz -lcurl -L${MKLROOT}/lib/intel64 -lmkl_rt -lpthread-ldl

# 注释下面的代码
#ifeq ($(MPILIB),mvapich2)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpich2)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpt)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),openmpi)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpich)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mvapich)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),impi)
#SLIBS := $(SLIBS) #-mkl=cluster
#endif
#ifeq ($(MPILIB),mpi-serial)
#SLIBS := $(SLIBS) #-mkl
#endif

如何创建

/create_newcase --case /data/gpfs01/$USER/cases/g.e20.G1850ECO.f19_g17.test --compset G1850ECO --res f19_g17 --run-unsupported   （蓝色标注部分为 创建案例的路径，需修改）
用xmlchange命令修改设定，比如
./xmlchange NTASKS=12
./xmlchange STOP_OPTION=nmonths, STOP_N=5
之后setup、build、submit任务即可：
./case.setup
./case.build
./case.submit

September 14, 2020February 9, 2022

[Computer architecture] Microcode Instruction, ISA

lab0

snipher
指令怎么处理，的到trace
report gradescope

HW1

2 week 4 ques

Paper reading

one compulsory and one optional

ISA

Stack and accumulator

Stack vs. Accumulator(implicitly)
e.g. C $\leftarrow$ A + B

Stack	Accumulator	GPR	GPR
Push A	Load A	Load R1, A	Load R1, A
Push B	Add B	Add R3, R1, B	Load R2, B
Add	Store C	Store R3, C
Pop C

Stack: no register, but stack

Pros
- Simple Model of expression evaluation (Reverse Polish Notation)
- Short instruction, i.e., push, pop, etc.
Cons
- Stackcan‘tberandomlyaccessed
- Stack accessed every operation, to be a bottleneck
  Accumulator: one register, i.e., accumulator
Pros
- Shortinstructions
Cons
- Accumulator is only temporary storage, thus with high memory traffics.

CISC vs. RISC

CISC
- Complex instruction set computer • Rep: x86
RISC
- Reduced instruction set computer • Reps: RSIC-V, MIPS, SPARC
Main features of RISC, in contrast to CISC
- A large number of registers and a highly regular instruction pipeline, allowing a low number of clock cycles per instruction (CPI) for high throughput
  - SPARC and RISC-V both with 32 general-purpose integer registers
  - X86, 8 general-purpose integer registers
- Uniform instruction format
- Load-store architecture
  - Only load and store instruction can access memory

Hardwired vs. Microcoded

Microcoded control (PLA)
- Implemented using ROMs/RAMs
- Indirect next_state function: “here’s how to compute next state”
- Slower ... but can do complex instructions
- Multi-cycle execution (of control)
Hardwired control
- Implemented using logic (“hardwired” can’t re-program)
- Direct next_state function: “here is the next state”
- Faster ... for simple instructions (speed is function of complexity)
- Single-cycle execution (of control)

Why Microcode

How RISC to CISC and then to RISC

Control vs. Datapath

control can be split between datapath, where numbers are stored and arithmetic operations computed and control, which sequences operations on datapath.

Single-Bus Datapath for Mircrocoded RISC-V

Microinstructions written as register transfers:

MA:=PC means RegSel=PC; RegW=0; RegEn=1; MALd=1
B:=Reg[rs2] means RegSel=rs2; RegW=0; RegEn=1; BLd=1
Reg[rd]:=A+B means ALUop=Add; ALUEn=1; RegSel=rd; RegW=1

Microcode Sketches

Instruction Fetch:

MA,A:=PC
PC:=A+4
wait for memory
IR:=Mem
dispatch on opcode

Pure Rom Implementation

Instruction fetch sequence 3 common steps
~12 instruction groups
Each group takes ~5 steps (1 for dispatch)
Total steps 3+12*5 = 63, needs 6 bits for µPC
Opcode is 5 bits, ~18 control signals
Total size = $2^{6+5+2}*(6+18)=2^{13}*24 = ~25KiB$!

Reduce Control Store Size

Reduce ROM height (#address bits)
- Use external logic to combine input signals
- Reduce #states by grouping opcodes
Reduce ROM width (#data bits)
- Restrict µPC encoding (next, dispatch, wait on memory, …)
- Encode control signals (vertical µcoding, nanocoding)

Microcoded r Hardcoded -> Hardwired

Horizontal vs. Vertical µCode

Horizontal µcode has wider µinstructions
- Multiple parallel operations per µinstruction
- Fewer microcode steps per macroinstruction
- Sparser encoding ⇒ more bits
Vertical µcode has narrower µinstructions
- Typically a single datapath operation per µinstruction
  - separate µinstruction for branches
- More microcode steps per macroinstruction
- More compact ⇒ less bits but requires more capability of paralization
Nanocoding
- Tries to combine best of horizontal and vertical µcode

WCS 可编程

ROP

buffer overflow 攻击 Return-Oriented-Programing

图灵完备在这个系统中写程序能够找到解决方法(尽管不保证运行时和内存)

September 13, 2020February 9, 2022

[Database] SQL

Admin
Grading

Homework: 20%
Quizs: 10%
Course project: 25%
Midterm: 20%
Final exam: 25%

Reason to have a Database

Unitility

Data processing backs essentially every app
Databases of one form or another back most apps
The principles taught in this class back nearly everything in computing

Centrality

Data is at the center of modern society.
Unprecedented in its nature and significance
- Particular and voluminous
- Often asymmetric
  - low value in isolation, high value when aggregated
- Difficult to protect
The infrastructure determines what’s possible

September 13, 2020February 9, 2022

关系型数据库

容器存储关系的集合
- Relation
  - schema - fixed
  - Instance - change often
    - multiset
- Tuple
- Attribute
- DDL
  - Sailor
    
    Group by 聚合函数或select
- DML

SELECT S.dept, AVG(S.gpa), COUNT(\*) FROM Students S
WHERE S.gender = 'F'
GROUP BY S.dept
HAVING COUNT(\*) >= 2 ORDER BY S.dept;

Distinct Aggregate

SELECT COUNT(**DISTINCT** S.name) 
FROM Students S
WHERE S.dept = 'CS';

SELECT **DISTINCT** COUNT(S.name) 
FROM Students S
WHERE S.dept = 'CS';

e.g

SELECT S.name, AVG(S.gpa) 
FROM Students S
GROUP BY S.dept;

name has multiple of them. Should be S.name or coercion function

SUMMARY for SQL1

Relational model has well-defined query semantics
Modern SQL extends “pure” relational model (some extra goodies for duplicate row, non-atomic types... more in next lecture)
Typically, many ways to write a query
- DBMS figures out a fast way to execute a query, regardless of how it is written.

DML 多表

SELECT [DISTINCT]

FROM <single table>
[WHERE <predicate>]
[GROUP BY <column list>
[HAVING <predicate>] ] [ORDER BY <column list>] [LIMIT <integer>];

select - collection

join Queries 链接查询

SELECT [DISTINCT] <column expression list>
FROM <table1 [AS t1], ... , tableN [AS tn]> 
[WHERE <predicate>]
[GROUP BY <column list>[HAVING <predicate>] ] [ORDER BY <column list>];

cross(Catesian) product

All pairs of tuples, concatenated

Use where to filter - 有预约记录的信息

SELECT S.sid, S.sname, R.bid
FROM Sailors AS S, Reserves AS R 
WHERE S.sid=R.sid #外键==内键

先找到再去除，先行后列

AS is column Name and Table Aliases

self-join

AS can be used as the product's col name.

In a must in this case

Arithmetic Expression

SELECT S.age, S.age-5 AS age1, 2*S.age AS age2 
FROM Sailors AS S
WHERE 2*S.age = S2.age - 1

SQL calculator

SELECT
log(1000) as three,
exp(ln(2)) as two,
cos(0) as one,
ln(2*3) = ln(2) + ln(3) as sanity;

4 - 1

string comparisons

Old School SQL (Like)

SELECT S.sname
FROM Sailors S
WHERE S.sname LIKE 'B_%’

Standard Regular Expressions

SELECT S.sname
FROM Sailors S 
WHERE S.sname ~ 'B.*’

combining Predicates

Subtle connections between:

Boolean logic in WHERE (i.e., AND, OR)
Traditional Set operations (i.e. INTERSECT, UNION)

Sid’s of sailors who reserved a red OR a green boat.

SELECT R.sid
FROM Boats B, Reserves R 
WHERE R.bid=B.bid AND (B.color='red' OR B.color='green')

Sid’s of sailors who reserved a red AND a green boat.

SELECT R.sid
FROM Boats B, Reserves R
WHERE R.bid=B.bid AND B.color='red'

UNION ALL

SELECT R.sid
FROM Boats B, Reserves R
WHERE R.bid=B.bid AND B.color='green'

UNION ALL ~ OR set operation
INTERSECT ~ AND set operation

set semantics

Set: acollection of distinct elements
Standard ways of manipulating / combining sets
- Union
- Intersect
- Except
Treat tuples within a relation as elements of a set

Default to set semantics

Relational tables

SQL	Relation
Set/Multiset	Set
ordering	non-order

Table is just the interpretation of the relation.

Distinct

$A=<x,y>$
$B=<y>$
$A/B=<x>$
Relational Division: “Find sailors who’ve reserved all boats.” Said differently: “sailors with no counterexample missing boats”

SELECT S.sname FROM Sailors S WHERE NOT EXISTS
(SELECT B.bid
FROM Boats B
WHERE NOT EXISTS (SELECT R.bid
FROM Reserves R WHERE R.bid=B.bid AND R.sid=S.sid ))

ARGMAX

max's other property

SELECT *
FROM Sailors S WHERE S.rating >= ALL
(SELECT S2.rating FROM Sailors S2)

SELECT *
FROM Sailors S WHERE S.rating =
(SELECT MAX(S2.rating) FROM Sailors S2)

both are okay, the second is not stable and does not produce 并列。

INNER JOIN

减少cross product 的and

SELECT s.*, r.bid
FROM Sailors s, Reserves r WHERE s.sid = r.sid
AND ...

SELECT s.*, r.bid
FROM Sailors s INNER JOIN Reserves r ON s.sid = r.sid
WHERE ...

join variable

SELECT <column expression list> FROM table_name
[INNER | NATURAL
| {LEFT |RIGHT | FULL } {OUTER}] JOIN
table_name
ON <qualification_list> WHERE ...

The shadowed manifest list capacity.

Left Outer join

SELECT s.sid, s.sname, r.bid
FROM Sailors2 s LEFT OUTER JOIN Reserves2 r 
ON s.sid = r.sid;

不匹配赋NULL
行数和sailor一样

SELECT r.sid, b.bid, b.bname
FROM Reserves2 r RIGHT OUTER JOIN Boats2 b 
ON r.bid = b.bid

不匹配赋NULL
行数和boat一样

NATURAL & INNER

SELECT s.sid, s.sname, r.bid FROM Sailors s, Reserves r WHERE s.sid = r.sid
AND s.age > 20;

SELECT s.sid, s.sname, r.bid
FROM Sailors s INNER JOIN Reserves r ON s.sid = r.sid
WHERE s.age > 20;

SELECT s.sid, s.sname, r.bid
FROM Sailors s NATURAL JOIN Reserves r WHERE s.age > 20;

Equal, natural 只输出on 行成立的部分

FULL OUTER JOIN

「s.bname」不存在它置为NULL
「b.name， b.bid」不存在它置为NULL

Views

Named Queries

Makes development simpler
Often used for security
Not Materialized

CREATE VIEW Redcount
AS SELECT B.bid, COUNT(*) AS scount FROM 
Boats2 B, Reserves2 R
WHERE R.bid=B.bid AND B.color='red' GROUP BY B.bid

Subqueries in FROM

Like a “view on the fly”!

SELECT bname, scount FROM Boats2 B,
(SELECT B.bid, COUNT (*)
FROM Boats2 B, Reserves2 R
WHERE R.bid = B.bid AND B.color = 'red' GROUP BY B.bid) AS Reds(bid, scount)
WHERE Reds.bid=B.bid AND scount < 10

Common table experssion (WITH)

another view on the fly

WITH Reds(bid, scount) AS
(SELECT B.bid, COUNT (*)
FROM Boats2 B, Reserves2 R
WHERE R.bid = B.bid AND B.color = 'red' GROUP BY B.bid)

ARGMAX GROUP BY

The sailor with the highest rating per age

WITH maxratings(age, maxrating) AS (SELECT age, max(rating)
FROM Sailors
GROUP BY age)

SELECT S.*
FROM Sailors S, maxratings m
WHERE S.age = m.age
AND S.rating = m.maxrating;

NULL values

Field values are sometimes unknown
– SQL provides a special value NULL for such situations.
– Every data type can be NULL
The presence of null complicates many issues. E.g.:
– Selection predicates (WHERE)
– Aggregation
But NULLs comes naturally from Outer joins

NULL op x is NULL
Explicit NULL checks

SELECT * FROM sailors WHERE rating IS NOT NULL;

if NULL is after WHERE, it will not output

NULL and Aggregation

September 10, 2020February 9, 2022

如何在Python logging.Formatter 格式化

我目前正在尝试居中与👉Python记录器中的日志记录级别字段，输出如下：

[    test_log    ][    DEBUG]  test (color_logger.py:66)
[    test_log    ][     INFO]  test (color_logger.py:67)
[    test_log    ][  WARNING]  test (color_logger.py:68)
[    test_log    ][    ERROR]  test (color_logger.py:69)
[    test_log    ][ CRITICAL]  test (color_logger.py:70)

但看起来像：

[__main__][DEBUG]  test (color_logger.py:67)
[__main__][INFO]  test (color_logger.py:68)
[__main__][WARNING]  test (color_logger.py:69)
[__main__][ERROR]  test (color_logger.py:70)
[__main__][CRITICAL]  test (color_logger.py:71)

有两个问题，

funcName 而不是 name
得考虑右对齐和居中

解决方法

import logging

BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(8)

#The background is set with 40 plus the number of the color, and the foreground with 30
#These are the sequences need to get colored ouput
RESET_SEQ = "\033[0m"
COLOR_SEQ = "\033[1;%dm"
BOLD_SEQ = "\033[1m"

def formatter_message(message, use_color = True):
    if use_color:
        message = message.replace("$RESET", RESET_SEQ).replace("$BOLD", BOLD_SEQ)
    else:
        message = message.replace("$RESET", "").replace("$BOLD", "")
    return message

COLORS = {
    'WARNING': YELLOW,
    'INFO': WHITE,
    'DEBUG': BLUE,
    'CRITICAL': YELLOW,
    'ERROR': RED
}

class ColoredFormatter(logging.Formatter):
    def __init__(self, msg, use_color = True):
        logging.Formatter.__init__(self, msg)
        self.use_color = use_color

    def format(self, record):
        levelname = record.levelname
        if self.use_color and levelname in COLORS:
            levelname_color = COLOR_SEQ % (30 + COLORS[levelname]) + levelname + RESET_SEQ
            record.levelname = levelname_color
        return logging.Formatter.format(self, record)



# Custom logger class with multiple destinations
class ColoredLogger(logging.Logger):
    FORMAT = "[$BOLD" + "%(funcName)s".center(20," ")+"$RESET]["+ "%(levelname)20s" +"]  %(message)s ($BOLD%(filename)s$RESET:%(lineno)d)"
    COLOR_FORMAT = formatter_message(FORMAT, True)
    def __init__(self, name):
        logging.Logger.__init__(self, name, logging.DEBUG)                

        color_formatter = ColoredFormatter(self.COLOR_FORMAT)

        console = logging.StreamHandler()
        console.setFormatter(color_formatter)

        self.addHandler(console)
        return

August 20, 2020February 9, 2022

关于std::sort函数开源可见的版本是否是跑的最快的版本。

https://godbolt.org/z/jxoYx7

我们可以很清楚的看到都是call 函数。我看到的源码有用intro sort 实现的gnu libcxx 也有分桶排序的llvm 的版本，可是linux distro都是出厂自带，且每个公司都会对自己的代码进行魔改。今天被dhz的问题问到了，我还是cpp学艺不精。

zlw 曾经问过我一个问题。是std::sort 非常快，且看不到源码，这和我看到的好像不太一致，我从来没调试过glibc 和libcxx.so 源码和自带的不一样，谁知道呢？

如果glibc可以插一些奇怪的东西，那就太棒了。

linux team's motto: make our guest quicker

August 17, 2020December 8, 2022

Parallel Computing Final Project FineTunedHashML Spring 2020

The final repo is stored at gitlab, added some hash function to the original algorithm.

Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSH-Deep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)

SLIDE is a project to reimplement BP on the CPU. After the LSH pre-processing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations- Sparse Backpropagation or Gradient Update.

After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.

As research in the paper, we can see that the main overhead is pre-processing on CPU avx512 with huge pages cache-heap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/src-d/minhashcuda

Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.

My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?

Thanks a lot!

August 11, 2020November 18, 2023

Reinforcement Learning Project Spring 2020

May 12, 2020February 9, 2022

[RL] Monte-Carlo Methods

Monte-Carlo RL

MC learn directly from episodes of experience.
MC is model-free. No knowledge of MDP transitions or rewraeds.
MC learns fro complete episodes. no bootstrapping
1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in $SARSA(\lambda)$ or $Q(\lambda)$
2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
MC uses the simplest possible idea: value = mean return
2 ways:
1. model-free: no model necessary and still attains optimality
2. Simulated: needs only a simulation, not a full model
Caveat: can only apply MC to episodic MDPs
1. All episodes must terminate

policy evaluation

The goal is to learn $v_\pi$ from episodes of experience under policy $\pi$
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: $G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T$.

The value function is the expected return : $v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]$
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence

In fact it's the "backward mean $\frac{S(s)}{N(s)}$" to update $V(s)$

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean $\mu_1,\mu_2.... $ of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

TD methods learn directly from episodes of experience
TD is model-free:no knowledge of MDP transitions/rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

$(number)$ the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.

MC make a nearest prediction

We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

TD can learn before knowing the final outcome
1. TD can learn online after every step (less memory & peak computation)
2. MC must wait until end of episode before return is known
TD can learn without the final outcome
1. TD can learn from incomplete sequences
2. MC can only learn from complete sequences
3. TD works in continuing (non-terminating) environments
4. MC only works for episodic (terminating) environment
  ### result is TD performs better in random walk

batch MC and TD

add step parameter $\alpha$

Unified comparison

$\to$

Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling:update samples an expectation
- MC samples
- DP does not sample
- TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}^{{(1)}}{G_{t}^{{(2)}}=\frac{R_{t+1}+\gamma}} V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}^{{(\infty)}=R_{t+1}+\gamma} R_{t+2}+\ldots \gamma^{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}^{{(n)}=R_{t+1}+\gamma} R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{{(n)}-V\left(S_{t}\right)\right)\end{array}}
$$