## Parallel Computing Final Project FineTunedHashML Spring 2020

The final repo is stored at gitlab, added some hash function to the original algorithm.

Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSH-Deep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)

SLIDE is a project to reimplement BP on the CPU. After the LSH pre-processing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations- Sparse Backpropagation or Gradient Update.

After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.

As research in the paper, we can see that the main overhead is pre-processing on CPU avx512 with huge pages cache-heap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/src-d/minhashcuda

Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.

My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?

Thanks a lot!

## shared memory algorithm design

For non-shared memory algorithms, we have to utilize bararrier to gain data transformed.

for shared memory database algorithm, we have to decide the group part of the task that shared memory a lot . Then just apply the inserting directives of omp or mpi

### Design considerations

two main considerations lies in data dependence and load balance so we have to apply the following steps

• data dependence analysis
• static r dynamic and block r cyclic work assignment
• variable specification whether using shared private r reduction and row-wise r column-wise
• shared variables cause cache coherence traffic and much lower performance
• private and reduction variables don't need synchronization
• dimension mapping is more relying on the cache locality

## Main Consideration for data dependence analysis

RAW & WAR & WAW

all the collations should be avoided though they might run into right situations

Goal is to we should run all the dependent situation on the same processor.

## loop dependence analysis

• loop-carried dependence
• dependence exists across different iterations of loop
• loop-independent dependence
• dependence exists within the same iteration of loop

## iteration-space traversal graph (ITG)

• iteration-space traversal graph is a line graph showing the order of traversal in the iteration space.

## loop-carried dependence graph (LDG)

• Given the ITG, can determine the dependence between different loops.
• Loop-carried Dependence Graph (LDG) shows the loopcarried true/anti/output dependence relationships.
• Node in LDG is a point in the iteration space.
• Directed edge in LDG is the dependence.
• LDG helps identify parts of the loop that can be done in parallel.

## Intro

• Sparse matrix vector multiplication.
• Many scientific algorithms require multiplying a matrix by a vector.
• Optimization (e.g. conjugate gradient low degree mesh.), iterative methods (solving linear systems), eigenvalue methods (e.g. graph partitioning vx), simulations (e.g. finite elements), data analysis (e.g. Pagerank the web connectivity matrix).
• The matrices are often sparse.
• In an nxn matrix, there are $$O(n^2)$$ nonzero elements.

## [Parallel Computing] Concurrency bugs

• In parallel system, high performance achieved by using algorithm with high parallelism, good load balancing, memory locality, etc.
• But must also ensure multiple concurrent threads / processes operate correctly.
• Concurrency bugs can arise due to unexpected interleaving of concurrent threads.
• One of the most difficult issues to deal with in parallel / distributed computing.
• Bugs occur at random times depending on the interleaving.
• Bugs don’t occur during testing, but they will eventually occur in system deployed system.
• Humans have hard time anticipating and resolving concurrency bugs.Concurrency bugs can have very serious consequences.

Therac-25 radiation therapy system had a concurrency bug that led to radiation overdose and death of several patients.

Space shuttle aborted 20 minutes before maiden launch due to concurrency bug in its avionics software.

## eliminating concurrency bugs

• Multiple ways, each with pros and cons.
• Critical sections and locks
• Prevent processes from accessing a block of code at the same time.
• Easy to use, effective for some problems.
• But cause contention, overhead and serialization.
• Need to decide how much code to lock.
• Too little, and may still get concurrency bug.
• Too much, and we lose parallelism and performance.
• If processes acquire several locks, they need to coordinate to maintain correctness, avoid deadlock.
• Low priority that acquires a lock can delay high priority thread (priority inversion).
• Despite these problems, locks are still the most widely used solution.
• transactional memory
• A block of code is defined as a transaction, i.e. the block of code either executes atomically or doesn’t execute at all.
• Keep track of reads and writes done by a transaction. If two concurrent transactions read and write to same memory location, abort one of them, i.e. undo all the changes it made.
• Two concurrent transactions accessing different memory locations can both commit, i.e. all the changes it made are made permanent.
• Transactional memory can either be implemented in hardware (HTM) or software (STM).
• HTM has limits of size and type of transactions it can handle. Implemented in e.g. Intel Haswell, IBM Power8.
• STM is more flexible, but can be very slow.
• Write your own concurrent code, without hardware support.
• Challenging for most programmers. Not scalable in terms of productivity.
• Correct, efficient algorithms are often research level publications.

## Mutual exclution

• Given n concurrent processes that want to perform a critical section (CS), mutual exclusion can satisfy the following properties.
• No two processes are in CS at same time.
• If several processes want to enter the CS, at least one succeeds in finite time (deadlock freedom).
• If several processes want to enter the CS, every process succeeds in finite time (wait freedom).
• All (useful) mutex algorithms satisfy first and second properties.
• Some algorithms satisfy the third property, but have lower performance.

## Mutual exclusion algorithms

• Mutex is provided by locks. But how are locks implemented?
• Depends on the type of operations the underlying hardware supports.
• First type of algorithm uses only read / write operations.
• Second type uses hardware synchronization primitives such as test-and- set (TAS) or compare-and-swap (CAS), provided in most processors.
• TAS(x) tests if a Boolean variable x is true.
• If x == false, it sets x to true.
• Returns x’s value before the TS.
• All these steps done atomically, without interruption from other threads.
• getAndSet(x) is like TAS(x), but allows non-Boolean x.
• CAS(x,v,v’) tests if variable x currently equals v. If so, it sets x to v’. Otherwise, it doesn’t change x. It also returns x’s current value.
• Again, all this is atomic.
• Algorithms also depend on a processor’s memory model.
• Some processors reorder instructions to avoid stalls and obtain higher performance. This can break many lock algorithms.
• Most lock algorithms assume memory model is sequentially consistent, i.e. the execution order of instructions from different processes is an interleaving of the instructions of each process in program order.

## shared memory model compared with mpi

• individual sequences of instructions that can execute in parallel and access shared data
• very general, because programmer have to manage everything.
• parallel programming language/ library
• a parallel language or library is used to create code
• require new compiler
• Compiler directives (OMP)
• inserts compiler directives into a sequential program to specify parallelism and indicate shared data and the compiler translates into threads.
• easy to program. but lose flexibility.

• Process(MPI)
• separate program with its own variable, memory, stack and instruction pointer.
• Different programs can't access each other's memory

• Concurrent routine that shares the variables and memory space, but has its own stack and instruction pointer.

## [Parallel Comuting] Cuda intrinsics

Reference: Cuda dev blog

## why we introduce intrinsics

• Race condition: Different outcomes depending on execution order.

• Race conditions can occur in any concurrent system, including GPUs.

• Code should print a =1,000,000.
• But actually printed a = 88.
• GPUs have atomic intrinsics for simple atomic operations.

• Hard to implement general mutex in CUDA.

• Codes using critical sections don’t perform well on GPUs anyway.

The atomic thread in critical section is not good in performance. The example is identified for variable d_a

## cuda atomics

Can perform atomics on global or shared memory variables.

• Hardware ensures all 3 instructions happen without interruption from any other thread.
• int atomicMax(int *addr, int val)
• Reads value at addr, sets it to max of current value and val, returns old value.
• int atomicExch(int *addr1, int val)
• Sets val at addr to val, returns old value at val.
• int atomicCAS(int *addr, old, new)
• “Compare and swap”, a conditional atomic.
• Reads value at addr. If value equals old, sets value to new. Else does nothing.
• Indicates whether state changed, i.e. if your view is up to date.
• Universal operation, i.e. can be used to perform any other kind of synchronization

## Finding max of array

### improve it

we can improve it bu split the single global max into num_locals number of local max value.

Thread i atomically maxes with its local max. can max the local_max[locali],

a better solution is to make it into the tree DS+CA

## Compute the histogram

! whether the data is stored on shared or global memory is depend on programmers

## Shuffle intrinsics

! this intrinsic is not limited to cuda but all SIMD architecture

from avx256 vector elements we have

__m256 load_rotr(float *src)
{
#ifdef __AVX2__
__m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
return rotated_right;
#else
__m256 shifted = _mm256_loadu_ps(src + 1);
__m256 bcast = _mm256_set1_ps(*src);
return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}


For Kepler architecture, we have 4 intrisics.

The goal of the shuffle intrinsics is actually for optimizing Memory-Fetch Model

## butterfly operations

Assume we have at most shuffle instead of shared memory

## Collective communication

• Important as the basis for many parallel algorithms.
• Implemented using multiple point to point communications, possibly in parallel.
• Assume communicating m words (without contention) between a source and destination takes $$t_{s} + m t_{w}$$ time.
• Ignore the the distance of the message, since per hop latency th is usually small.
• With contention c, time becomes $$t_{s} + c m t_{w}$$.
• Implementations depend on communication hardware architecture.
• Operations
• All-reduce, prefix sum.
• Scatter and gather.
• All-to-all scatter and reduce.
• Circular shift.

## Broadcast and reduction on ring

just cut them into halves.

### cost

• With p processors, log p steps.
• In each step, a processor sends a size m message.
• Total time$(t_{s} + m t_{w}) log p$ take $$(t_{s} + m t_{w})$$ as const

### algorithm

• Root first broadcasts along its row using ring algorithm.
• The nodes of root’s row then broadcast along their columns using ring algorithm.

### cost

• $$log \sqrt{p} = (log p) / 2$$ steps along row, same along columns.
• Total time (ts + m tw) log p.log
• p = (log p) / 2 steps along row, same along columns.
• Total time (ts + m tw) log p.

### algorithm

RING CAN BE PROJECTED TO THE HYPERCUBE but with more congestion.

## One of the most import algorithm in the new century

### Intro to the fft

#### The basic equation of the FFT

$$F(\omega)=F|f(t)|=\int ^{+\infty}_{-\infty}f(t)e^{-j\omega t}dt$$

## Roots of unity

$\begin{array}{l}\text { An n'th root of unity is an } \omega \text { s.t. } \ \omega^{n}=1 \text { . } \ \text { There are n roots n'th roots of } \ \text { unity, and they have the form } \ e^{\frac{2 \pi i k}{n}}, \text { for } 0 \leq k \leq n-1 \text { . } \ \text {Write } \omega_{n}=e^{\frac{2 \pi i}{n}} \text { , so that the n'th } \ \text { roots of unity are } \omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1} \end{array}$

##### Some problems to differ DT,DFT,FFT,IFFT

They are Fourier Transform, Discrete Fourier Transform, Fast Fourier Transform and Inverse Fourier Transform.
The transform factor:
$$\text { Fact } 1 \omega_{n}^{n}=1 \text { . } \ \text { Fact } 2 \omega_{n}^{k+\frac{n}{2}}=-\omega_{n}^{k} \ \text { Fact } 3\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$$

Why we should have the DFT.
Because in theory, all the data stored in computer is Discrete. So we have to use the equation $$X(k)=\sum^{N-1}_0x(n)W^{kn}_N(k\in \mathbb{N})$$

The Transform factor is used to prove the
1) Periodicity
$$W_{N}^{m+l N}=W_{N}^{m},$$\) where $$\(: W_{N}^{-m}=W_{N}^{N-m}$$
2) Symmetry
$$W_{N}^{m+\frac{N}{2}}=-W_{N}^{m}$$
3) Contractability
$$W_{N / m}^{k}=W_{N}^{m k}$$
4) Special rotation factors
$$W_{N}^{0}=1$$

1. Why Fourier Fast Algorithm (aka FFT)?

FFT is a DFT-based algorithm designed to accelerate the computational speed of DFT.

Given a degree $$n$$ -1 polynomial
$$A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\dots+a_{n-1} x^{n-1}$$
DFT computes $$A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$$
since $$A(x)=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots\right) \ldots\right)$$
$$A(x)$$ can be evaluated in $$\mathrm{O}(\mathrm{n})$$ time and
$$\mathrm{O}(1)$$ memory.

• DFT can be computed trivially in $$\mathrm{O}\left(\mathrm{n}^{2}\right)$$
time.

For the DFT formula computer implementation the complexity is o(N²), while the complexity by FFT calculation is reduced to: N×log2(N)

1. What is the sequence split extraction in FFT?

The sequence splitting process of FFT is the extraction process, which is divided into: extraction by time and extraction by frequency.

1) Extraction by time (also called parity extraction)

2) Frequency, which we don’t apply here.

1. How does FFT reduce the amount of computation?

In simple terms, mathematicians use the above mentioned properties of the rotation factor W such as periodicity, symmetry, etc. to simplify the formula. In algorithmic programming, new points are constantly counted using points that have already been counted, i.e., old points count new points.

Theoretically, FFT computes the DFT in O(n log n) time using divide and conquer.
\square Suppose n is a power of 2.
Let $$A^{[0]}=a_{0}+a_{2} x^{1}+a_{4} x^{2}+\cdots+a_{n-2} x^{\frac{n}{2}-1}$$
$$A^{[1]}=a_{1}+a_{3} x^{1}+a_{5} x^{2}+\cdots+a_{n-1} x^{\frac{n}{2}-1}$$
Then $$A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)$$.
So can compute $$A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$$ by computing $$A^{[0]}$$ and $$A^{[1]}$$
at $$\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}$$, and multiplying some terms by
$$\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}$$.
But $$\left(\omega_{n}^{k+n / 2}\right)^{2}=\omega_{n}^{2 k+n}=\omega_{n}^{2 k}=\left(\omega_{n}^{k}\right)^{2}$$ by Fact 1.
A So $$\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}\right\}=\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n / 2-1}\right)^{2}\right\},$$ i.e. only need
to evaluate $$A^{[0]}$$ and $$A^{[1]}$$ at n/2 points instead of n.
Also, $$\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$$

Note: Simply splitting a multipoint sequence into a less point sequence without simplification is not a reduction in arithmetic volume!!! Splitting is only in the service of simplification, using the spin factor is the key to arithmetic reduction!!!

Time Complexity:
Thus, computing $$A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)$$ for
$$x \in\left\{\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}\right\}$$ requires
$$\square$$ Computing for $$A^{[0]}(x)$$ and $$A^{[1]}(x)$$ for $$x \in$$
$$\left\{\omega_{n / 2}^{0}, \omega_{n / 2}^{1}, \ldots, \omega_{n / 2}^{n / 2-1}\right\}$$

• These are also DFT’s, so can be done recursively using two
n/2-point FFT’s.
\square For $$0 \leq k \leq \frac{n}{2}-1$$
$\begin{array}{l}\qquad A\left(\omega_{n}^{k}\right)=A^{[0]}\left(\omega_{n / 2}^{k}\right)+\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right) \ \begin{array}{l}\qquad A\left(\omega_{n}^{k+n / 2}\right)=A^{[0]}\left(\omega_{n / 2}^{k+n / 2}\right)+\omega_{n}^{k+n / 2} A^{[1]}\left(\omega_{n / 2}^{k+n / 2}\right) \ =A^{[0]}\left(\omega_{n / 2}^{k}\right)-\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right)\end{array}\end{array}$

1. Butterfly operation?

For a butterfly operation, you can understand it as an operation that is customizable by a graph.

The left side is the input and the right side is the output, for the letters on the horizontal line there are two cases.

1) A number on the left end line (none means C and D are 1).

2)The right end lines have the numerical representation, but if none, C & D are 0s.

The FFT takes out odd and even in accordance to time to change the original sequence. which have to sequentialize it to make the algorithm meet the required sequence. The new sequence is the reverse binary sequence of the original ones.

For example $$(a_0 a_4 a_2 a_6 a_1 a_5 a_3 a_7)$$ have the binary sequence $$(000，100，010，110，001，101，011，111)$$.
The reverse order can be simply treated as the change of 2 near binary number, which in this case is:$$（a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7）$$;

Which is $$(000，001，010，011，100，101，110，111)—>(000，100，010，110，001，101，011，111)$$.

The code for this transformation:

for(I=0;I<N;I++) // According to law 4, you need to perform inter-code inverse order for all elements of the array
{
/* Get the value of the inverse J of subscript I*/
J=0;
for(k=0;k<(M/2+0.5);k++) //k indicates operation
{
/* Reverse sequence operation*/
m=1;//m is a binary number with a minimum of 1
n=(unsigned int)pow(2,M-1);//n is a binary number with the Mth degree of 1.
m <<= k; // for m move left k
n>>= k; //shift k bits to the right of n
F0=I & n;//I & n by the kth position of the first half of the extracted
F1=I & m;//I with m-pressure bit corresponding to the lower part of the second half of the extracted F0
if(F0) J=J | m; //J and m are in position or so that F0 corresponds to a low position of 1
if(F1) J=J | n; //J and n are in the same position or so that F1 corresponds to a high level of 1
}
if(I<J)
{
Temp=A[I];
A[I] =A[J];
A[J]=Temp;
}
}


The butter fly operation:
Now let’s imagine that if we want to program the FFT algorithm, the most basic implementation of the FFT algorithm is the butterfly operation, for any butterfly such as.

When N=8.

As can be seen from the above figure, to perform the butterfly operation, we have to solve the following problems.

• Interval B between two input data

• Determination of the rotation factor W, including.

• Determination of the L-level rotation index P.
• Determination of the type of Level L W.
• The interval between the same W in level L.

$$\left\{\begin{array}{l}T_{R}=X_{R}(\mathrm{j}+B) \cos \frac{2 \pi}{N} p+X_{I}(j+B) \sin \frac{2 \pi}{N} p \cdots(1) \ T_{\mathrm{I}}=X_{I}(j+B) \cos \frac{2 \pi}{N} p-X_{R}(j+B) \sin \frac{2 \pi}{N} p \cdots(2) \ \mathrm{A}_{\mathrm{R}}(\mathrm{j})=X_{R}(\mathrm{j})+T_{R} \ \mathrm{A}_{\mathrm{I}}(\mathrm{j})=X_{I}(\mathrm{j})+T_{\mathrm{I}} \ \mathrm{A}_{\mathrm{R}}(\mathrm{j}+\mathrm{B})=X_{R}(\mathrm{j})-T_{R}(5) \ \mathrm{A}_{\mathrm{I}}(\mathrm{j}+\mathrm{B})=X_{I}(\mathrm{j})-T_{\mathrm{I}}(6)\end{array}\right.$$

for(L=1; L<=M;L++) //FFT butterfly level L from 1--M
{
/* L-level operations*/
//First calculate the interval B = 2^(L-1);
B = 1;
B = (int)pow(2,L-1);
for(j=0; j<=B-1;j++)
{
/* Homogeneous butterfly operation*/
// First increment k=2^(M-L)
k = (int)pow(2,M-L);
// Calculate the rotation index p in increments of k, then p = j*k
P=1;
P=j*k;
for(i=0; i<=k-1;i++)
{
/* Perform butterfly operations*/
//Array calibrated to r
r=1;
r=j+2*B*i;
Tr=dataR[r+B]*cos(2.0*PI*p/N) + dataI[r+B]*sin(2.0*PI*p/N);
Ti=dataI[r+B]*cos(2.0*PI*p/N) - dataR[r+B]*sin(2.0*PI*p/N);
dataR[r+B]=dataR[r]-Tr;
dataI[r+B]=dataI[r]-Ti;
dataR[r]=dataR[r]-Tr; dataI[r+B]=dataI[r]-Ti; dataI[r]-Ti; dataR[r]=dataR[r]+Tr;
dataI[r]=dataI[r]]+Ti;
}
}
}


IFFT is the reverse of the above operation.

1. What if we take it on the mesh or hypercube to make it scalable on gpu oprations?

Hpercube:

• Consider the binary exchange algorithm on a hypercube architecture.
• Each processor connected to d others, which differ in each digit of ID.
• Communication only with neighbors, send n/p values each time.
• since d = log p rounds of communication, communication time $$T_{c}=$$ $$t_{s} \log p+t_{w} \frac{7}{p} \log p .$$
• Each stage does n/p computation.
• Total computation time $$T_{p}=\frac{t_{c} n}{p} \log n$$.
• Efficiency is $$E=\frac{T_{p}}{T_{p}+T_{c}}=1 /\left(1+\frac{t_{s} p \log p}{t_{c} n \log n}+\frac{t_{w} \log p}{t_{c} \log n}\right)$$
• Define $$K$$ such that $$E=1 /(1+1 / K) .$$
• For isoefficiency, want last two terms in denominator to be constant.
• $$\frac{t_{s} p \log p}{t_{c} n \log n}=\frac{1}{K}$$ implies $$n \log n=W=K \frac{t_{s}}{t_{c}} p \log p .$$
• $$\frac{t_{w} \log p}{t_{c} \log n}=\frac{1}{K}$$ implies $$\log n=K_{t_{c}}^{t_{w}} \log p,$$ so $$n=p^{K t_{w} / t_{c}},$$ so
• $$W=K \frac{t_{w}}{t_{c}} p^{K t_{w} / t_{c}} \log p$$

The efficiency for this case depends on $$t_c,t_s,t_w$$, the wait time is the tradeoffs between two different lines. which is:

From the Efficiency law
, we have once $$\frac{Kt_w}{t_c}>1$$, the increasing time is polynomial with regard to the processor count.

Mese:

2D transpose FFT: The best: