凸包计算

https://juejin.cn/post/6844904054749921294

March 2, 2020November 19, 2023

[Signal & System] Fourier transform

Definition

Example

a 3-dim Euclid space \(R^{3} \vec{a}=(1,0,0) \vec{b}=(0,1,0)\vec{c}=(0,0,1)\)
define n-dim space \(V\)
\(X(t)=\Sigma^{\inf}_{k=-\inf}e^{jk\omega _{0}t}\)
\(S=\{x(t)|x(t)=x(t+t_{0})\forall \}\) \(T_{0}=\frac{2\pi}{\omega_{0}}\)
comment: <2\(T_{0}\)
\(B=\{\phi_{k}t|\phi_{k}(t)=e^{jkw_{0}t} k\in Z\}\) \(B\subset C\)
We have:

The equation above can be deducted then:

各个弦波都是标准正交积。

Fourier's Idea

Example

composition in graph

Periodic signals & Fourier Series Expansion cont.

Some simple theory

time	frequency
period	discrete
non-period	continuous

deduction

\(a_{k}\) is denoted as Fourier coefficient.

example

系数\(a_{k}\) -> \(Fourier expression\)

前后均为实系数偶函数

误差

Convergence

决对可积

狄利克雷性质-收敛

只有第三个满足条件。

条件

频率域和时间域的性质等价

性质8

example

把任何信号变成标准正交积

February 29, 2020February 9, 2022

[Parallel Computing] Architecture

implicit parallelism - pipelinig

break up one instruction and execute pieces

different unit of cpu utilization
instruction fetch -> decode -> exec -> mem read -> write back
at most 5x with 5 stage pipeline

modern processors has 10-20 stages

if code branches, must guess how to full pipelines.

branch misprediction requires flushing pipline.
Typical code every 5 instructions.
so prediction is important

implicit parallelism - Superscalar

we can have 2 fetch unit each clock

((i)has fewer data dependencies (iii)should be rescheduled.)
e.g.
1000 1004 1008 100C
--o---o---o---o
-- |---/ ---|---/
---R1 ----- R2
----|-----/
execution must respect data dependencies.

To deal with data race, Use DAG to label them.
Look ahead using CA to reorder the instruction set.
## implicit parallelism - VLIW(compile time)

memory performance

latency & bandwidth
speed & lanes

memory hierarchy

caching

locality

Temporal locality - small set of data accessed repeatedly
spacial locality - nearby pieces of data accessed together

e.g.

Matrix Multiplications have both temporal locality and spatial locality.
IMG_FAC796E5B556-1

performance

cache hit rate is the proportion of data accesses serviced from the cache.
e.g.

processor architectures

a multiprocessor contains several cores communicating through memory.
a multicore has several cores on a single die.
1. each core is a complete processor, but runs simultaneously.
2. they share L2/L3 cache.
3. cheap to replicate CPU state.(registers, program counter, stack) (SMT)

GPGPU

1000+ core stream simultaneously.

Flynn's taxonomy

SISD conventional sequential processor.
SIMD single control unit for multiple processing unit
MISD uncommon
MIMD Distributed systems

SIMD

control unit fetches single instruction and broadcast it to processing units.
PEs each apply instruction to its own data item in synchrony.
- e.g. adding two arrays.
- very popular today.
  - dgemm, graphics and video, ML
  - cheap to implement, don't duplicate hardware for fetch.
  - Used in GPU(SIMT), avx, phi
  - 100s bits lanes -> 4-32 lanes

Instruction divergence

SIMD processors can perform different instructions.

Interconnection networks

built using links and switches
- links are fixed connection between 2 processors.
- switch connect set of processors on input ports with processors on output ports
static vs. dynamic
quadratic needed

shared memory architecture

all processors shared the same mems-pace.
Pros: easy to write program
Cons: limited scalability.
UMA vs. NUMA

Distributed memory architecture

combination of shared memory and non-shared memory.

Network of workstation(NOW)

connect workstations with networking.
e.g. Beowulf

Bus architecture.

all processors use mem on the same bus.
the bandwidth is the bottlenech.

Crossbar architecture.

switched network for higher end shared memory system.
allow all processors to communicate with all mem modules simultaneously. -nonblocking
Higher bandwidth

topology

multihop networks.
The combination of the bus and cross bar. flexibility between bandwidth and scalability.
feature:

Diameter
- Distance between farthest pair of processors.
- Gives worst case latency.
Bisection width. - fault tolerence
- Minimum number of links to cut to partition the network into 3 equal halves.
- Indicates bottlenecks.
- links out bandwidth.
cost
- the number of links.

1d topologies

meshes and tori

hypercube

advantage

good algorithm to calculate the path(only change the number of different bit)
Bisection can always be \(\frac{P}{2}\) so there's little bottleneck
nodes do not have hierarchy (v.s. fat tree)

February 29, 2020February 9, 2022

Basics

The set

general definition of Probability

样本空间

IMG_B1B034C78691-1

概率的物理意义

frequentist view: a long-run frequency over a large number of repetitions of an experiment.

Bayesian view: a degree of belief about the event in question.
We can assign probabilities to hypotheses like "candidate will win the election" or "the defendant is guilty"can't be repeated.

Markov & Monta Carlo + computing power + algorithm thrives the Bayesian view.

role

IMG_F7558CBB4476-1

条件概率

所有事情都有条件，条件就会产生概率
e.g. Conditioning -> DIVIDE & CONCUER -> recursively apply to multi-stage problem.

P（A｜B） = \(\frac{P(A\ and\ B)}{P(B)}\)

chain rules

有利于分布式计算

IMG_EC132FE4D2D1-1

Inference & Bayes' Rules

概率分布和极限定理

PDF 概率密度函数

混合型

IMG_434B41011BCA-1

PDF

valid PDF

non negative \(f(x)\geq0\)
integral to 1:
\(\int^{\infty}_{-\infty}f(x)dx=1\)

probability distribution

summary of probability distribution

三种距离衡量 in ML, DL, AI

全变量距离

usually in GAN

小数定理（稀疏事件） in poisson

去食堂吃饭人数可以用柏松分布来描述

Sample mean

强大数定理SLLN

收敛到真正的概率值以概率为一收敛

弱大数定理WLLN

以概率收敛

中心极限定理

Generating function

PGF - Z
MGF - Laplace
CF - 傅立叶

APPLICATION

branching process
bridge complex and probability
play a role in large deviation theory
## Multi variables.
joint distribution provides complete information about how multiple r.v. interact in high-dimensional space

joint CDF &PDF

marginal PMF

conditional PMF

joint PDF

Screen Shot 2020-03-03 at 03.04.48
Screen Shot 2020-03-03 at 03.29.53
Screen Shot 2020-03-03 at 03.31.50

techniques

general Bayes' Rules.

general LOTP

change of variables

summary

Order Statistics

CDF of order statistic

Screen Shot 2020-03-03 at 03.57.04

proof

PDF of Order Statostic

two methods to find PDF

CDF -differentiate> PDF (ugly)
PDF*dx
###proof

## joint PDF

e.g. order statistics of Uniforms

story:beta-Binomial Conjugacy

Screen Shot 2020-03-03 at 16.07.50

Mean vs Bayes'

deduction

e.g. 拉普拉斯问题

来自大名鼎鼎的拉普拉斯的问题，若给定太阳每天都升起的历史记录，则太阳明天仍然能升起的概率是多少？

拉普拉斯自己的解法：
假定太阳升起这一事件服从一个未知参数A的伯努利过程，且A是[0,1]内均匀分布，则利用已给定的历史数据，太阳明天能升起这一事件的后验概率为
\(P(Xn+1|Xn=1,Xn-1=1,...,X1=1)=\frac{P(Xn+1,Xn=1,Xn-1=1,...,X1=1)}{P(Xn=1,Xn-1=1,...,X1=1)}\)=Aⁿ⁺¹ 在[0,1]内对A的积分/Aⁿ 在[0,1]内对A的积分=\(\frac{n+1}{n+2}\),即已知太阳从第1天到第n天都能升起，第n+1天能升起的概率接近于1.

Monte carlo

importance sampling

reduce the 方差

importance sampling

example

February 29, 2020February 9, 2022

[Computer Architecture] Numbers Notes

big IDEAs

Abstraction
Moore's Law
Principle of Locality/Memory Hierarchy
Parallelism
Performance Measurement and Improvement
Dependability VIA Redundancy

old conventional wisdom

Moore's Law t+ Dennard Scaling = faster, cheaper, low-power

signed &unsigned Intergers

unsigned

e.g. for unsigned int: adresses
0000 0000 0000 0001\(_{two}\) = \(1_{ten}\)
0111 1111 1111 1111\(_{two}\) = \(2^{11}-1\ _{ten}\)

signed

e.g. for signed int: int x,y,z;
1000 0000 0000 0000\(_{two}\) = \(-2^{11}\ _{ten}\)

main idea

want \(\frac{1}{2}\) of the int >=0, \(\frac{1}{2}\) of the int <0

two complement

basic ideas

for 3\(_{ten}\)=0011\(_{two}\)
for 10000\(_{two}\)-0011\(_{two}\)=1\(_{two}\)+1111\(_{two}\)-0011\(_{two}\)=1101\(_{two}\)

more e.g.

Assume for simplicity 4 bit width, -8 to +7
represented
PNG图像
There's an overflow here
Overflow when magnitude of result too big to fit into result representation
Carry in = carry form less significant bits
Carry out = carry to more significant bits

take care of the MSB(Most significant bits)
to detect overflow is to check whether carry in = carry out in the MSB

summary

February 28, 2020February 29, 2020

test of vega sgemm

0.841471 0.540302 0 16384 16384 640 656 656 16400 0.6 0.6
9 3 0 512 512 256 512 256 512 1.4013e-44 2
0.841471 0.540302 0 512 512 256 272 272 528 1.4013e-44 2
cout<<A[1]<<" "<<B[1]<<" "<<C[1]<<" " <<m << " "<<n<<" "<<k<<" "<<lda<<" "<<ldb<<" "<<ldc<<" "<<alpha[1]<<" "<<beta[1]<<endl;

February 28, 2020February 9, 2022

[Parallel Computing] Intro

Administrivia

Rui Fan
Leshan Wang

curent ddl

score complement

what and why

applications

Fluid dynamics
DNA & drug
Quantum / atomic simulation, cosmological

challenges

Harnessing power of masses
Communication
1. Processors compute faster than they can communicate.
2. Problem gets worse as number of processors increase
3. Main bottleneck to parallel computing
Synchronization
Scheduling
Structured vs. unstructured
1. Structured problems can be solved with custom hardware.
2. Unstructured problems more general, but less efficient.
Inherent linmitations
1. Some problems are not( or don't seem to be ) paralleizable
  1. Dijkstra's shortest paths algorithm
2. Other problems require clever algorithms to become parallel.
  1. Fibonacci series (\(a_{n}=a_{n-1}+a_{n-2}\))
3. The human factor
  1. Hard to keep track of concurrent events and dependencies.
  2. Parallel algorithms are hard to design and debug.

Course Outline

Parallel architectures.
1. shared memory
2. distributed memory
3. many more
Parallel languages
1. OpenMP
2. MPI
3. cuda
4. MapReduce
Algorithm design techniques
1. Decomposition
2. Load balancing
3. schedling
  ## state of the art
parellel computers today mainly based on four processor architectures.
1. Multi-cores
2. Many-cores
3. FPGA
4. ASIC
power efficiency: goal 50GFLOPS/W
Top-500

February 25, 2020February 11, 2022

Shanghaitech GeekPie 2020 WarmUP CTF Game - G20G Stage 3 Doc

Stage 3

3-1 失踪的快递

湖北省武汉市某同学家。从得到一个快递号，没来的去取，上面的身份证号已经模糊，需要猜出才能取得快递。

快递编号 3102511818424 → X

百度第一次认识 Google 是什么时候 → Y

身份证号

TL;DR - 目前的选择是 42011620060523627X。

前六位

前六位为行政区域编码。

湖北武汉黄陂区

考虑到能自己少做一些就少做一些，目前看到快递单号 3102511818424 还可以：

3102511818424

湖北武汉黄陂区

420116

这个单号的优点是直接用百度搜索会提示“：( 抱歉，查询出错，请重试或点击快递公司官网地址进行查询。”，但实际上到韵达快递官网是能找到的。

日期部分

需要一个带有年月日的日期。

百度第一次认识 Google 是什么时候

百度搜 Google

最早结果 2001年6月30日

20010631

百度第一次删除 Google 是什么时候

可能需要一些线索提示要去“百度百科”找答案

Google 词条编辑历史

最早删除 2006-05-23

20060523

其他方法

举例：

询问黑板报得到答案

最后四位

数字都给出来，顺序要你自己猜。选择校验码为 X，则可以使用 627X。

3-2 快递的秘密

Onion 域名

blgpxymqjoo35curmvsldxejuq5vsyf5orutrfp25bdan223t62a3vad.onion/42011620010631726X 跳转到一个教务系统的登陆界面。需要输入自己的邮箱和与之匹配的用户名登陆。之后跳转到

查看课件

ppt隐写

改ppt为zip解压出藏在最后的网址和线索

3-3 救人的药

ppt给出线索跳转到jupyter hub 的登陆界面

网址 victoryang00.xyz:5006 用户名 jupyter1-6 密码 g20

每运行一步能得到一部分分数

最后在前端输入所有可能的总共化合物的碳原子比氢原子的比值的最小值（保留13位有效数字）得到分数

此步骤有mutex 锁（显卡只有一块，而训练会allocate全部显存），即一组在运行时，另一组无法进行，也是拉大时间差距的一关

答案是 0.8235294117647

3-4 电话号码

承接上回给出的线索，以下线索按时间顺序（每10分钟）给出，要求得到电话号码

位置提示1

https://j.map.baidu.com/f3/fGw 最接近球状物体的地方

位置提示2

https://j.map.baidu.com/a2/C_w 和上一张图同时存在的地方

位置提示3

上海微小卫星工程中心

答案是 (021)50735001

February 25, 2020February 9, 2022

[Parallel Computing] cuda 默认流&非默认流（异步流和可视化）分析

并发 cuda stream
默认流具有更高的优先级，在QuEST中默认流为Statevec_compactUnitaryKernel 函数，同时占用的gpu时间和运算重点都在此核函数上。

给定流中的操作会按序执行。

就不同非默认流中的操作而言，无法保证其会按彼此之间的任何特定顺序执行。
默认流会受到阻碍，并在其他所有流完成之后方可运行，但其亦会阻碍其他流的运行直至其自身已运行完毕。

定义非默认流，把cudaStream_t stream; 作为参数传递，让编译器自动完成默认流的创建和复制。
用cudaMallocManaged 和 cudaMemPrefetchAsync

更详细的内存管理：

cudaMalloc GPU分配内存
cudaMallocHost 把内存分配在cpu上

第三个杀手锏 cudaMemcpyAsync

February 25, 2020February 9, 2022

What does Multi-Armed Bandit means?

credit:https://iosband.github.io/2015/07/19/Efficient-experimentation-and-multi-armed-bandits.html

At first, multi-armed bandit means using
\(f^* : \mathcal{X} \rightarrow \mathbb{R}\)

Each arm \(i\) pays out 1 dollar with probability \(p_i\) if it is played; otherwise it pays out nothing.
While the \(p_1,…,p_k\) are fixed, we don’t know any of their values.
Each timestep \(t\) we pick a single arm \(a_t\) to play.
Based on our choice, we receive a return of \(r_t \sim Ber(p_{a_t})\).
##How should we choose arms so as to maximize total expected return?##