Scala 逆变和协变

source:
trait List[+T]{}
当类型 B 是类型 A 的子类型时,则 List[B]也可以认为是 List[A}的子类型,即 List[B]可以泛化 为 List[A]。也就是被参数化类型的泛化方向与参数类型的方向是一致的,所以称为协变 (covariance)

逆变定义形式如:trait List[-T]{} 当类型 B 是类型 A 的子类型,

则 Queue[A]反过来可以认为是 Queue[B}的子类型。

也就是被 参数化类型的泛化方向与参数类型的方向是相反的,所以称为逆变(contravariance)

summary

scala 环境配置

export SCALA_HOME=/root/Downloads/spark-2.4.3-bin-hadoop2.7/
export HADOOP_HOME=/root/Downloads/hadoop-2.7.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=Master
SPARK_LOCAL_DIRS=/root/Downloads/spark-2.4.3-bin-hadoop2.7
SPARK_DRIVER_MEMORY=512M

[Signal & System] Fourier transform

Definition

Example

  1. a 3-dim Euclid space \(R^{3} \vec{a}=(1,0,0) \vec{b}=(0,1,0)\vec{c}=(0,0,1)\)
  2. define n-dim space \(V\)
    \(X(t)=\Sigma^{\inf}_{k=-\inf}e^{jk\omega _{0}t}\)
    \(S=\{x(t)|x(t)=x(t+t_{0})\forall \}\) \(T_{0}=\frac{2\pi}{\omega_{0}}\)
    comment: <2\(T_{0}\)
    \(B=\{\phi_{k}t|\phi_{k}(t)=e^{jkw_{0}t} k\in Z\}\) \(B\subset C\)
    We have:

    The equation above can be deducted then:


各个弦波都是标准正交积。

Fourier's Idea

Example


composition in graph

Periodic signals & Fourier Series Expansion cont.

Some simple theory

time frequency
period discrete
non-period continuous

deduction

\(a_{k}\) is denoted as Fourier coefficient.

example


系数\(a_{k}\) -> \(Fourier expression\)



前后均为实系数偶函数

误差

Convergence


决对可积

狄利克雷性质-收敛


只有第三个满足条件。

条件

频率域和时间域的 性质等价

性质8

example

把任何信号变成标准正交积

[Parallel Computing] Architecture

implicit parallelism - pipelinig

break up one instruction and execute pieces

different unit of cpu utilization
instruction fetch -> decode -> exec -> mem read -> write back
at most 5x with 5 stage pipeline

modern processors has 10-20 stages

if code branches, must guess how to full pipelines.

  1. branch misprediction requires flushing pipline.
  2. Typical code every 5 instructions.
    so prediction is important

implicit parallelism - Superscalar

  1. we can have 2 fetch unit each clock

    ((i)has fewer data dependencies (iii)should be rescheduled.)
    e.g.
    1000 1004 1008 100C
    --o---o---o---o
    -- |---/ ---|---/
    ---R1 ----- R2
    ----|-----/
  2. execution must respect data dependencies.

    To deal with data race, Use DAG to label them.
    Look ahead using CA to reorder the instruction set.
    ## implicit parallelism - VLIW(compile time)

memory performance

latency & bandwidth
speed & lanes

memory hierarchy

caching

locality

Temporal locality - small set of data accessed repeatedly
spacial locality - nearby pieces of data accessed together

e.g.

Matrix Multiplications have both temporal locality and spatial locality.
IMG_FAC796E5B556-1

performance

cache hit rate is the proportion of data accesses serviced from the cache.
e.g.

processor architectures

  1. a multiprocessor contains several cores communicating through memory.
  2. a multicore has several cores on a single die.
    1. each core is a complete processor, but runs simultaneously.
    2. they share L2/L3 cache.
    3. cheap to replicate CPU state.(registers, program counter, stack) (SMT)

GPGPU

1000+ core stream simultaneously.

Flynn's taxonomy

  • SISD conventional sequential processor.
  • SIMD single control unit for multiple processing unit
  • MISD uncommon
  • MIMD Distributed systems

SIMD

  • control unit fetches single instruction and broadcast it to processing units.
  • PEs each apply instruction to its own data item in synchrony.
    • e.g. adding two arrays.
    • very popular today.
      • dgemm, graphics and video, ML
      • cheap to implement, don't duplicate hardware for fetch.
      • Used in GPU(SIMT), avx, phi
      • 100s bits lanes -> 4-32 lanes

Instruction divergence

SIMD processors can perform different instructions.

Interconnection networks

  • built using links and switches
    • links are fixed connection between 2 processors.
    • switch connect set of processors on input ports with processors on output ports
  • static vs. dynamic
  • quadratic needed

shared memory architecture

all processors shared the same mems-pace.
Pros: easy to write program
Cons: limited scalability.
UMA vs. NUMA

Distributed memory architecture


combination of shared memory and non-shared memory.

Network of workstation(NOW)

connect workstations with networking.
e.g. Beowulf

Bus architecture.

all processors use mem on the same bus.
the bandwidth is the bottlenech.

Crossbar architecture.

switched network for higher end shared memory system.
allow all processors to communicate with all mem modules simultaneously. -nonblocking
Higher bandwidth

topology

multihop networks.
The combination of the bus and cross bar. flexibility between bandwidth and scalability.
feature:

  • Diameter
    • Distance between farthest pair of processors.
    • Gives worst case latency.
  • Bisection width. - fault tolerence
    • Minimum number of links to cut to partition the network into 3 equal halves.
    • Indicates bottlenecks.
    • links out bandwidth.
  • cost
    • the number of links.

1d topologies


meshes and tori



hypercube

advantage

  1. good algorithm to calculate the path(only change the number of different bit)
  2. Bisection can always be \(\frac{P}{2}\) so there's little bottleneck
  3. nodes do not have hierarchy (v.s. fat tree)

[RL] Probability Review

Basics

The set

general definition of Probability

样本空间

IMG_B1B034C78691-1

概率的物理意义

frequentist view: a long-run frequency over a large number of repetitions of an experiment.

Bayesian view: a degree of belief about the event in question.
We can assign probabilities to hypotheses like "candidate will win the election" or "the defendant is guilty"can't be repeated.

Markov & Monta Carlo + computing power + algorithm thrives the Bayesian view.

role

IMG_F7558CBB4476-1

条件概率

所有事情都有条件,条件就会产生概率
e.g. Conditioning -> DIVIDE & CONCUER -> recursively apply to multi-stage problem.

P(A|B) = \(\frac{P(A\ and\ B)}{P(B)}\)

chain rules

有利于分布式计算

IMG_EC132FE4D2D1-1

Inference & Bayes' RulesIMG_F7558CBB4476-1

概率分布和极限定理

PDF 概率密度函数

混合型

IMG_434B41011BCA-1

PDF

valid PDF

  1. non negative \(f(x)\geq0\)
  2. integral to 1:
    \(\int^{\infty}_{-\infty}f(x)dx=1\)

probability distribution

summary of probability distribution


三种距离衡量 in ML, DL, AI

全变量距离

usually in GAN

小数定理(稀疏事件) in poisson


去食堂吃饭人数可以用柏松分布来描述

Sample mean

强大数定理SLLN


收敛到真正的概率值以概率为一收敛

弱大数定理WLLN


以概率收敛

中心极限定理

Generating function

  1. PGF - Z
  2. MGF - Laplace
  3. CF - 傅立叶

APPLICATION

  1. branching process
  2. bridge complex and probability
  3. play a role in large deviation theory
    ## Multi variables.
    joint distribution provides complete information about how multiple r.v. interact in high-dimensional space

joint CDF &PDF



marginal PMF

conditional PMF

joint PDF



Screen Shot 2020-03-03 at 03.04.48
Screen Shot 2020-03-03 at 03.29.53
Screen Shot 2020-03-03 at 03.31.50
Screen Shot 2020-03-03 at 03.31.59
Screen Shot 2020-03-03 at 03.32.11

techniques

general Bayes' Rules.

general LOTP

change of variables


summary

Order Statistics

CDF of order statistic

Screen Shot 2020-03-03 at 03.57.04

proof

PDF of Order Statostic


two methods to find PDF

  1. CDF -differentiate> PDF (ugly)
  2. PDF*dx
    ###proof

    ## joint PDF

e.g. order statistics of Uniforms

story:beta-Binomial Conjugacy

Screen Shot 2020-03-03 at 16.07.50

Mean vs Bayes'


deduction

e.g. 拉普拉斯问题

来自大名鼎鼎的拉普拉斯的问题,若给定太阳每天都升起的历史记录,则太阳明天仍然能升起的概率是多少?

拉普拉斯自己的解法:
假定太阳升起这一事件服从一个未知参数A的伯努利过程,且A是[0,1]内均匀分布,则利用已给定的历史数据,太阳明天能升起这一事件的后验概率为
\(P(Xn+1|Xn=1,Xn-1=1,...,X1=1)=\frac{P(Xn+1,Xn=1,Xn-1=1,...,X1=1)}{P(Xn=1,Xn-1=1,...,X1=1)}\)=An+1 在[0,1]内对A的积分/An 在[0,1]内对A的积分=\(\frac{n+1}{n+2}\),即已知太阳从第1天到第n天都能升起,第n+1天能升起的概率接近于1.

Monte carlo

importance sampling

reduce the 方差

importance sampling

example

[Computer Architecture] Numbers Notes

big IDEAs

  1. Abstraction
  2. Moore's Law
  3. Principle of Locality/Memory Hierarchy
  4. Parallelism
  5. Performance Measurement and Improvement
  6. Dependability VIA Redundancy

old conventional wisdom

Moore's Law t+ Dennard Scaling = faster, cheaper, low-power

signed &unsigned Intergers

unsigned

e.g. for unsigned int: adresses
0000 0000 0000 0001\(_{two}\) = \(1_{ten}\)
0111 1111 1111 1111\(_{two}\) = \(2^{11}-1\ _{ten}\)

signed

e.g. for signed int: int x,y,z;
1000 0000 0000 0000\(_{two}\) = \(-2^{11}\ _{ten}\)

main idea

want \(\frac{1}{2}\) of the int >=0, \(\frac{1}{2}\) of the int <0

two complement

basic ideas

for 3\(_{ten}\)=0011\(_{two}\)
for 10000\(_{two}\)-0011\(_{two}\)=1\(_{two}\)+1111\(_{two}\)-0011\(_{two}\)=1101\(_{two}\)

more e.g.

Assume for simplicity 4 bit width, -8 to +7
represented
PNG图像
There's an overflow here
Overflow when magnitude of result too big to fit into result representation
Carry in = carry form less significant bits
Carry out = carry to more significant bits

take care of the MSB(Most significant bits)
to detect overflow is to check whether carry in = carry out in the MSB

summary

test of vega sgemm

0.841471 0.540302 0 16384 16384 640 656 656 16400 0.6 0.6
9 3 0 512 512 256 512 256 512 1.4013e-44 2
0.841471 0.540302 0 512 512 256 272 272 528 1.4013e-44 2
cout<<A[1]<<" "<<B[1]<<" "<<C[1]<<" " <<m << " "<<n<<" "<<k<<" "<<lda<<" "<<ldb<<" "<<ldc<<" "<<alpha[1]<<" "<<beta[1]<<endl;

[Parallel Computing] Intro

Administrivia

Rui Fan
Leshan Wang

curent ddl

score complement

what and why

applications

  1. Fluid dynamics
  2. DNA & drug
  3. Quantum / atomic simulation, cosmological

challenges

  1. Harnessing power of masses
  2. Communication
    1. Processors compute faster than they can communicate.
    2. Problem gets worse as number of processors increase
    3. Main bottleneck to parallel computing
  3. Synchronization
  4. Scheduling
  5. Structured vs. unstructured
    1. Structured problems can be solved with custom hardware.
    2. Unstructured problems more general, but less efficient.
  6. Inherent linmitations
    1. Some problems are not( or don't seem to be ) paralleizable
      1. Dijkstra's shortest paths algorithm
    2. Other problems require clever algorithms to become parallel.
      1. Fibonacci series (\(a_{n}=a_{n-1}+a_{n-2}\))
    3. The human factor
      1. Hard to keep track of concurrent events and dependencies.
      2. Parallel algorithms are hard to design and debug.

Course Outline

  1. Parallel architectures.
    1. shared memory
    2. distributed memory
    3. many more
  2. Parallel languages
    1. OpenMP
    2. MPI
    3. cuda
    4. MapReduce
  3. Algorithm design techniques
    1. Decomposition
    2. Load balancing
    3. schedling
      ## state of the art
  4. parellel computers today mainly based on four processor architectures.
    1. Multi-cores
    2. Many-cores
    3. FPGA
    4. ASIC
  5. power efficiency: goal 50GFLOPS/W
  6. Top-500

[Parallel Computing] cuda 默认流&非默认流 (异步流和可视化)分析


并发 cuda stream
默认流具有更高的优先级,在QuEST中默认流为Statevec_compactUnitaryKernel 函数,同时占用的gpu时间和运算重点都在此核函数上。

给定流中的操作会按序执行。

  1. 就不同非默认流中的操作而言,无法保证其会按彼此之间的任何特定顺序执行。
  2. 默认流会受到阻碍,并在其他所有流完成之后方可运行,但其亦会阻碍其他流的运行直至其自身已运行完毕。

定义非默认流,把cudaStream_t stream; 作为参数传递,让编译器自动完成默认流的创建和复制。
cudaMallocManagedcudaMemPrefetchAsync

更详细的内存管理:

  1. cudaMalloc GPU分配内存
  2. cudaMallocHost 把内存分配在cpu上

第三个杀手锏 cudaMemcpyAsync