Entertainment 🤮 - Page 11 - vickieGPT’s blog

May 6, 2020February 9, 2022

[Parallel Computing] Loop dependence analysis

Intro

shared memory algorithm design

For non-shared memory algorithms, we have to utilize bararrier to gain data transformed.

for shared memory database algorithm, we have to decide the group part of the task that shared memory a lot . Then just apply the inserting directives of omp or mpi

Design considerations

two main considerations lies in data dependence and load balance so we have to apply the following steps

data dependence analysis
static r dynamic and block r cyclic work assignment
variable specification whether using shared private r reduction and row-wise r column-wise
- shared variables cause cache coherence traffic and much lower performance
- private and reduction variables don't need synchronization
- dimension mapping is more relying on the cache locality

Main Consideration for data dependence analysis

RAW & WAR & WAW

all the collations should be avoided though they might run into right situations

Goal is to we should run all the dependent situation on the same processor.

loop dependence analysis

loop-carried dependence
- dependence exists across different iterations of loop
loop-independent dependence
- dependence exists within the same iteration of loop

example

iteration-space traversal graph (ITG)

iteration-space traversal graph is a line graph showing the order of traversal in the iteration space.

loop-carried dependence graph (LDG)

Given the ITG, can determine the dependence between different loops.
Loop-carried Dependence Graph (LDG) shows the loopcarried true/anti/output dependence relationships.
Node in LDG is a point in the iteration space.
Directed edge in LDG is the dependence.
LDG helps identify parts of the loop that can be done in parallel.

examples

Distance and direction vector

Loop

Algorithm analysis

May 2, 2020May 9, 2020

My EFI for Clover RX590 Z390UD 9700k Hackintosh

credit those who've helped me on x86tom site and pcbeta.

http://openwrt.victoryang00.xyz:8880/index.php?share/file&user=1&sid=qeEj3QHF

April 25, 2020February 8, 2022

Why you should use `express`

introduction

I manually do some stuff like

wget -r -p -np -k http://ics.nju.edu.cn/~jyywiki/

Many of the time you may get into the scenario the page you scripy from the website. they are rendered by the js. Admittedly, you can continue to use request_html. The idea is to use Chroium core to dynamically render the js page and grab the important information.

from requests_html import HTMLSession

If you want to deploy them locally, you have to get the express.

var express = require('express');
var path = require('path');
var ejs = require('ejs');
//import the package here

var app = express();

// view engine setup
app.set('views', path.join(__dirname, '/wiki'));
app.engine('html', require('ejs').__express);  
app.set('view engine', 'html');
//youcan implement the function used in the cache page here.
router.get('/*', function(req, res, next) {
  res.type('html');
  res.render('*');
});

//credit: https://blog.csdn.net/u011481543/article/details/79696565

node server.js

Save it to the server.js with the relative path and run

April 20, 2020February 9, 2022

[Parallel Computing] SpMV

Intro

Sparse matrix vector multiplication.
Many scientific algorithms require multiplying a matrix by a vector.
- Optimization (e.g. conjugate gradient low degree mesh.), iterative methods (solving linear systems), eigenvalue methods (e.g. graph partitioning vx), simulations (e.g. finite elements), data analysis (e.g. Pagerank the web connectivity matrix).
The matrices are often sparse.
- In an nxn matrix, there are $O(n^2)$ nonzero elements.

DIA Format

ELL format

COO format

CSR format

Hybird format

what to use

ELL kernel

CSR scalar vector kernel

CSR vector kernel

COO vector kernel

April 20, 2020February 9, 2022

[Parallel Computing] Concurrency bugs

In parallel system, high performance achieved by using algorithm with high parallelism, good load balancing, memory locality, etc.
But must also ensure multiple concurrent threads / processes operate correctly.
Concurrency bugs can arise due to unexpected interleaving of concurrent threads.
One of the most difficult issues to deal with in parallel / distributed computing.
- Bugs occur at random times depending on the interleaving.
- Bugs don’t occur during testing, but they will eventually occur in system deployed system.
- Humans have hard time anticipating and resolving concurrency bugs.Concurrency bugs can have very serious consequences.
  Therac-25 radiation therapy system had a concurrency bug that led to radiation overdose and death of several patients.
  
  Space shuttle aborted 20 minutes before maiden launch due to concurrency bug in its avionics software.

eliminating concurrency bugs

Multiple ways, each with pros and cons.
Critical sections and locks
- Prevent processes from accessing a block of code at the same time.
- Easy to use, effective for some problems.
- But cause contention, overhead and serialization.
- Need to decide how much code to lock.
- Too little, and may still get concurrency bug.
- Too much, and we lose parallelism and performance.
If processes acquire several locks, they need to coordinate to maintain correctness, avoid deadlock.
Low priority that acquires a lock can delay high priority thread (priority inversion).
Despite these problems, locks are still the most widely used solution.
transactional memory
- A block of code is defined as a transaction, i.e. the block of code either executes atomically or doesn’t execute at all.
- Keep track of reads and writes done by a transaction. If two concurrent transactions read and write to same memory location, abort one of them, i.e. undo all the changes it made.
- Two concurrent transactions accessing different memory locations can both commit, i.e. all the changes it made are made permanent.
- Transactional memory can either be implemented in hardware (HTM) or software (STM).
- HTM has limits of size and type of transactions it can handle. Implemented in e.g. Intel Haswell, IBM Power8.
- STM is more flexible, but can be very slow.
Write your own concurrent code, without hardware support.
- Challenging for most programmers. Not scalable in terms of productivity.
- Correct, efficient algorithms are often research level publications.

Mutual exclution

Given n concurrent processes that want to perform a critical section (CS), mutual exclusion can satisfy the following properties.
- No two processes are in CS at same time.
- If several processes want to enter the CS, at least one succeeds in finite time (deadlock freedom).
- If several processes want to enter the CS, every process succeeds in finite time (wait freedom).
All (useful) mutex algorithms satisfy first and second properties.
Some algorithms satisfy the third property, but have lower performance.

Mutual exclusion algorithms

Mutex is provided by locks. But how are locks implemented?
- Depends on the type of operations the underlying hardware supports.
- First type of algorithm uses only read / write operations.
- Second type uses hardware synchronization primitives such as test-and- set (TAS) or compare-and-swap (CAS), provided in most processors.
TAS(x) tests if a Boolean variable x is true.
- If x == false, it sets x to true.
- Returns x’s value before the TS.
- All these steps done atomically, without interruption from other threads.
- getAndSet(x) is like TAS(x), but allows non-Boolean x.
CAS(x,v,v’) tests if variable x currently equals v. If so, it sets x to v’. Otherwise, it doesn’t change x. It also returns x’s current value.
- Again, all this is atomic.
- Algorithms also depend on a processor’s memory model.
- Some processors reorder instructions to avoid stalls and obtain higher performance. This can break many lock algorithms.
- Most lock algorithms assume memory model is sequentially consistent, i.e. the execution order of instructions from different processes is an interleaving of the instructions of each process in program order.

first attempt

April 20, 2020February 9, 2022

[Parallel Computing] Shared Memory Programming and OpenMP

shared memory model compared with mpi

Threads (e.g. Pthreads, Java)
- individual sequences of instructions that can execute in parallel and access shared data
- very general, because programmer have to manage everything.
parallel programming language/ library
- a parallel language or library is used to create code
- require new compiler
Compiler directives (OMP)
- inserts compiler directives into a sequential program to specify parallelism and indicate shared data and the compiler translates into threads.
- still uses threads underneath, but system manages the threads.
- easy to program. but lose flexibility.

Processes and thread

Process(MPI)
- separate program with its own variable, memory, stack and instruction pointer.
- Different programs can't access each other's memory
thread(OMP)
- Concurrent routine that shares the variables and memory space, but has its own stack and instruction pointer.

April 20, 2020February 9, 2022

[Parallel Comuting] Cuda intrinsics

Reference: Cuda dev blog

why we introduce intrinsics

Race condition: Different outcomes depending on execution order.
Race conditions can occur in any concurrent system, including GPUs.
- Code should print a =1,000,000.
- But actually printed a = 88.
GPUs have atomic intrinsics for simple atomic operations.
Hard to implement general mutex in CUDA.
- Codes using critical sections don’t perform well on GPUs anyway.

The atomic thread in critical section is not good in performance. The example is identified for variable d_a

cuda atomics

Can perform atomics on global or shared memory variables.

int atomicInc(int *addr)
- Reads value at addr, increments it, returns old value.
- Hardware ensures all 3 instructions happen without interruption from any other thread.
int atomicAdd(int *addr, int val)
- Reads value at addr, adds val to it, returns old value.
int atomicMax(int *addr, int val)
- Reads value at addr, sets it to max of current value and val, returns old value.
int atomicExch(int *addr1, int val)
- Sets val at addr to val, returns old value at val.
int atomicCAS(int *addr, old, new)
- “Compare and swap”, a conditional atomic.
- Reads value at addr. If value equals old, sets value to new. Else does nothing.
- Indicates whether state changed, i.e. if your view is up to date.
- Universal operation, i.e. can be used to perform any other kind of synchronization

examples

Finding max of array

improve it

we can improve it bu split the single global max into num_locals number of local max value.

Thread i atomically maxes with its local max. can max the local_max[locali],

a better solution is to make it into the tree DS+CA

Compute the histogram

! whether the data is stored on shared or global memory is depend on programmers

Shuffle intrinsics

! this intrinsic is not limited to cuda but all SIMD architecture

from avx256 vector elements we have

__m256 load_rotr(float *src)
{
#ifdef __AVX2__
    __m256 orig = _mm256_loadu_ps(src);
    __m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
    return rotated_right;
#else
    __m256 shifted = _mm256_loadu_ps(src + 1);
    __m256 bcast = _mm256_set1_ps(*src);
    return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}

For Kepler architecture, we have 4 intrisics.

The goal of the shuffle intrinsics is actually for optimizing Memory-Fetch Model

butterfly operations

Assume we have at most shuffle instead of shared memory

April 20, 2020January 29, 2022

[Parallel Computing] Amdahi and Gustafson, metrics to measure parallel program

2 great law

Amdahi

Amdahi law defines the speedup formula after the parallelization of the serial system.

The defination of the speedup： $Speedup = \frac{Time\ before\ optimized} {Time\ after\ optimized}$. The higher the speedup, the better the optimization.

Proof: The time after optimization: $T_n＝T_1(F+\frac1{n(1-F)})$, where $T_1$ stands for the time before optimization, F is the percentage of serialized program，$(1-F)$ is the percentage of paralized program，n is the number of processor number. Take those into the equation we have $\frac{T_1}{T_n}$，namely $\frac{1}{(F + \frac{1}{n(1-F)})}$.

From the equation we can induce that the $Speedup$ is inversely related with the $F$. Also, from the equation, we can see that adding processor's number is just another method of providing $Speedup$. Besides, lowering $F$ can have the same effect.

For example in the $ASC19$, besides merely applying the $MPI$ parameter modification. The serialized part like initialization and special operation on the Qubits.

Gustafson

Gustafson is another important ways of evaluating the speedup rate

define the serialized processing time $a$，parallelized processing time $b$, namely single CPU situation, so the executing time $a+b$ overall executing time $a+nb,$ n denotes CPU number.

define the propertion F of a out of the executing time $F=\frac{a}{(a+b)}$

we get the final speedup rate. $s(n)=a+\frac{nb}{a}+b=\frac a{a+b} + \frac{nb}{a+b} = F + \frac{n(b-a+a)}{a+b }= F + n(1-F)$

From the equation, we have if the serialized part is tiny enough, $s(n) equals to the number of the CPU, so reducing the serialized part make sense, as well as adding the CPU core.

difference between two criteria

Two different law treats the parallel program from 2 different angles. Amdahi said that when the serial ratio is fixed, there is an upper limit by adding a CPU. By reducing the serial ratio and increasing the number of CPUs, the acceleration ratio can be improved. Gustafson is saying that when serial comparison tends to be very small, it can be seen from the formula that adding cpu can improve the speedup ratio

Multi-threaded features

Because of the inconsistency and security of data in a multi-threaded environment, some rule-based control is required. Java's memory model JMM regulates the effective and correct execution of multi-threading, and JMM is precisely the atomicity, visibility, and Orderly, so this blog introduces some multi-threaded atomicity, visibility and order.

Atomic

For single-threaded, it is indeed atomic, such as an int variable, change the value, it is the value when reading, this is very normal, we go to the system to run, it is also the same, because our operation Most of the system is 32-bit and 64-bit, int type 4 bytes, that is, 32-bit, but you can try the value of long type, long type is 8 bytes, which is 64-bit, if both threads Read and write it? In a multi-threaded environment, a thread changes the value of the long type and then reads it. The obtained value is not necessarily the value just changed, because your system may be 32-bit, and the long type is 64-bit. Yes, if both threads read and write to the long type, this will happen.

Visibility

How to understand visibility? First of all, there is no visibility for single-threaded. Visibility is for multi-threaded. For example, if one thread has changed, does another thread know that the thread has changed, this is visibility . For example, the variable a is a shared variable. The variable a on cpu1 is cache optimized, and the variable a is placed in the cache. At this time, the thread on the other cpu2 changes the variable a. This operation is for cpu1. The thread is not visible, because cpu1 has been cached, so the thread on cpu1 reads the variable a from the cache, and found that the value read on cpu2 is not consistent.

Ordered

For single-threaded, the code execution of a thread is in order, which is true, but in a multi-threaded environment may not be necessary. Because in a multi-threaded environment instruction rearrangement may occur. In other words, in a multi-threaded environment, code execution is not necessarily ordered.

Since disorder is caused by rearrangement, then all instructions will be rearranged? of course not. The rearrangement follows: Happen-Before rule.

April 19, 2020February 6, 2022

有关高考前作文是否是'文青和文青跨时空对话'？

余乃一介草民，最近要写文章，回顾以前的文章，有所感。高中时每每写及作文时，大概都添加了我对家庭，我对社会的看法。我出生在一个科技、理性，不逐利、仓廪实的家庭。当时的语言表达随常常词不达意。但却是时至今日我的人生哲学。

找到痛苦的幸福

周国平在复旦做了一场《生活的幸福》演讲，座次一位嘘声，自诩貌美有才，家庭美好，成绩优异，却无法得到幸福。

幸福不是一种没有痛苦的状态。

由于当时我挺喜欢这句话的。我讨厌这种秀优越感却实际上什么都不自知的人，但有时候自己却是这样的。我感激我的家庭给我带来的视野、高度。这使得当时我对很多“文青”嗤之以鼻。无所适从到想对他们口诛笔伐。我少了一种装逼的想法，同时有些自卑，因为我成绩就是不如同班同学，无论我多么努力。也许是能力不够吧，或者想要的太多。所以我内心还是非常自卑的，想通过“看不上别人”，来得到自身的慰藉。

我对这篇文章大体的想法是想写成与自己的对话。如果自己已经感受到世俗意义上的幸福了，是否应该皈依痛苦，再从痛苦中寻找自我的价值。

周国平这句话实际上也是一种放大性的解读。痛苦不是幸福的对立面。或许有时候进入痛苦也能得到幸福，不痛苦也可以幸福。这对很多文青来说就是他们的发挥空间了。比如这种痛苦可以是找寻西绪福斯般的“有痛苦却很幸福”，最终找到痛苦也很幸福。然后再辅以华丽的与评卷老师会心的想法。便可以得到不错的分数。

当时我刚从伯克利回来没多久，是一次高二上的月考。同时接触很多家里有钱，同时又很优秀，最后不乏申上伯克利计算机的一些同学，这种人的人物肖像就是口上说的都是满满的资源。所以我将这位同学带入了他们。他们其实很幸福，可以一天换一个女友。想象王思聪。我开始了我的骂战。我同时还想到了当时的“航母”，即通过每天网上学习学到3点的女生，最后考上了清华新亚书院。我说，不要那么为了幸福而痛苦。还有就是我的不作为的母亲和不作为的语文老师。若是母亲给力一点。若是语文老师给力一点。我会好走一点。

我引经据典，告诉他们所谓的“小有成就”在明清时的夜航船上就有很多自吹自擂最后落得自己“一无所知”的下场。得到四书五经的真传又如何？这种将八股奉为圭臬，陷于自己的舒适区的人，实则仅得到孔孟程朱的边角料，闹得只会对对联这种恶俗文青操作。没有真正意义上的创新，也即是把自己放在舒适区的结果。当时的中国没有进步，人均生产一直维持在低点千年的原因莫过于此。我们放弃了“科技”，而沉浸于古人的辉煌。

人需要痛苦，需要痛苦去圆满充实人生，从而造福社会。

我想说，人是要有家国情怀的。您那些幸福不过是屁民想的屁事。仓廪实则需关注民生，自己痛苦，谋求全社会的幸福。

一个聪明人最容易想完人生，感觉人生就那样，做好分内之事，何必痛苦呢？于是每天拘泥于自己的小确幸中，如“龙应台”之流，这是“平庸之恶”。那位学生应把自己的所谓的财力投入自己的“痛苦”，走出舒适圈。而不应像儒士一样没有眼光。它因转化为力量，造福人类。

两个例子，一正一反。古巴银行行长切格瓦拉和傻逼王思聪。

炮轰没有幸福的人，没有痛苦的人。

“人需寻得所爱。”乔帮主用这句话结束给“世界大脑”们的演讲，意在说明人总能找到属于自己的价值，为之痛，为之乐。这份痛是带有主观判断的。不能麻木盲从地痛苦下去，折磨自己很简单，但绝不能陷入琐事的痛苦，不能让片面庸俗的痛苦成为成就高水平幸福的绊脚石。史铁生一生“埋没”在病痛痛苦的极致中，却在地坛公园中给全中国人民在文革后以生命意义的慰藉。苏轼成长之路世人皆知，从青年到老年的跃变，从无法摆脱对时弊的束缚到无以伦比的胸襟。他突破了苦的边界，为百姓疾苦共舞。他的幸是苦叠加的结果。相反万历朱翊钧倒在了臣子阴阳利益斗争的血泊和三征高丽的痛苦中，失了心智，未能续大明之幸，张居正之托，终享其乐。

我这段话想说的就是有意义的高层次痛苦能带来更高层面的幸福。

找到痛苦后的幸福，是一种磨砺，更是一种蜕变。人不能简单满足现有的条件，而更应想着如何转化条件为更高层面的幸福。高层次的幸福就是家国之幸。同时，幸福不是简单盲目的，而是要有判断与取舍。

我的人生哲学尽如此。从不断的痛苦和打击之下历练自己。我要学技术，会管人。有自信。把“家国之幸”作为己任，修身齐家治国平天下。我不会做我眼中的小人，我妈，我的高中语文老师，liujingwen，还有那帮留学生们。我会为自己的意义奋斗终身，为祖国健康工作50年。

说说这篇文章为啥是三类下卷，即得了44/70. 大致上是偏题了，而且用词朴素，没啥料。好像老师看上去没有同感，或者说根本懒得读。还有，周国平在哪里？

这是一个17岁的少年在仅剩30分钟完卷的内心独白。我不适合被动的学习。从这以后我都是很主动的，从本被父母拉着出国（我认为是一个叛逃者），到作为一个合格的高考生。我或许在当时就应该认识到差距，可我被自己外表自大，内心自卑的性格所麻痹，认为经过努力，一定能得到属于我的一片天地。可事实是，语文老师不理解我的心思。

我挺喜欢读当时的那些文青的，现在估计转市场？学法律？抑或学经济了。我人对他们嗤之以鼻。我认为那个时候的作文，在我们的语文老师的渲染下变成了一种文青交流的乐园。

我的人生阅历太少，当时的阅读也少，之后读了挺多东西的，但鲜少能总结成文字。也会写书评，但得不到赞赏。我很少会对自己能力提升无济于事的事多努力，大概我一开始就认定那些人写的东西不值一提了吧。

鸢（yuān)飞戾(lì)天者，望峰息心；经纶（lún）世务者，窥（kuī）谷忘反。——《与朱元思书》

April 16, 2020February 6, 2022

[Signal and System] Fourier transform

三角函数的时频图和复频图

clear;clf;
syms t;
f=2;T=1/f;t0=0;
A=1;
%ft=A*sin(2*pi*f*t);
ft=A*cos(2*pi*f*t+pi/4);
subplot(2,2,[1,3]); fplot(ft,[0 3]);title("time domain" ); xlabel( 't(second)');ylabel( 'y(t)' );
w = 2*pi*f;N =10;
Fn = zeros(1,N);
Wn = zeros(1,N);
for k = 0:N-1
	Fn(k+1) = 2*1/T*int(ft*exp(-1j*k*w*t),t, [t0, t0+T]);
	Wn(k+1) = k*w;
end
subplot(2,2,2); stem(Wn/ (2*pi),abs(Fn));title('frequency_ _amplitude' ); xlabel('f(Hz)' );ylabel('Y');
subplot(2,2,4); stem(Wn/ (2*pi),angle(Fn)*180/pi);title(' frequency_ phase'); xlabel('f(Hz)' );ylabel('Y')

方波信号的时频图和复频图

实际上方波信号也是三角函数的加和，可以从复频图中看出。

clear;clf;
t=0:0.01:3;
f=2;T=1/f;t0=0;
A=1;
%ft=A*sin(2*pi*f*t);
ft=square(2*pi/T*t);
subplot(2,2,[1,3]); fplot(ft,[0 3]);title("time domain" ); xlabel( 't(second)');ylabel( 'y(t)' );
w = 2*pi*f;N =10;
Fn = zeros(1,N);
Wn = zeros(1,N);
for k = 0:N-1
  fun=@(t) square(2*pi/T*t).*exp(-1j*k*w*t);
  Fn(k+1) = 2/T*integral(fun,t0,t0+T);
  Wn(k+1) = k*w;
end
subplot(2,2,2); stem(Wn/ (2*pi),abs(Fn));title('frequency_ _amplitude' ); xlabel('f(Hz)' );ylabel('Y');
subplot(2,2,4); stem(Wn/ (2*pi),angle(Fn)*180/pi);title(' frequency_ phase'); xlabel('f(Hz)' );ylabel('Y')

信号转换可以找到信号的特性

clear;clf;
Fs = 1000;
ts = 1/Fs;
N = 3000;
t = (0:N-1)*ts;
signal1 = 0.7*sin(2*pi*50*t)+sin(2*pi*120*t)+2*randn(size(t));
subplot(2,1,1);plot(t,signal1); title('time domain'); xlabel('t(second)');ylabel('y(t)');
Y = fft(signal1);
Y = abs(Y/N);
Y = Y(1:N/2+1);
Y(2:end) = 2*Y(2:end);
f = (0:N/2)*Fs/N;
subplot(2,1,2);plot(f,Y); title(' frequency domain' ); xlabel('f(Hz)');ylabel('Y');

一种看时域图的角度

方波是一种频域三角函数

t = -2:0.001:2;
N = input('N=');
a0 = 0.5;
f = a0*ones(1,length(t));
for n=1:2:N
f = f+cos(n*pi*t)*sinc(n/2);
end
plot(t,f);grid on;
title(['N=' num2str(N)]);
xlabel('t');ylabel('f(t)');
axis([-2,2,-0.2,1.2])

t = -2:0.001:2;
N = input('N=');
T1 = 2;
w1 = 2*pi/T1;
fun = @(t) t.^0;
a0 = 1/T1*integral(fun,-0.5,0.5);
f = a0;
an = zeros(1,N);
bn = zeros(1,N);
for i = 1:N
fun = @(t) (t.^0).*cos(i*w1.*t);
an(i) = 2/T1.*integral(fun,-0.5,0.5);
fun = @(t) (t.^0).*sin(i*w1.*t);
bn(i) = 2/T1.*integral(fun,-0.5,0.5);
f = f + an(i)*cos(i*w1.*t)+bn(i)*sin(i*w1.*t);
end
plot(t,f); grid on;
title(['N=' num2str(N)]);
axis([-2 2 -0.2 1.2]);