CUDA - vickieGPT’s blog

March 17, 2020May 24, 2020

parallel programing 23 parallel fft

One of the most import algorithm in the new century

Intro to the fft

The basic equation of the FFT

$F(\omega)=F|f(t)|=\int ^{+\infty}_{-\infty}f(t)e^{-j\omega t}dt$

Roots of unity

\[
\begin{array}{l}\text { An n'th root of unity is an } \omega \text { s.t. } \ \omega^{n}=1 \text { . } \ \text { There are n roots n'th roots of } \ \text { unity, and they have the form } \ e^{\frac{2 \pi i k}{n}}, \text { for } 0 \leq k \leq n-1 \text { . } \ \text {Write } \omega_{n}=e^{\frac{2 \pi i}{n}} \text { , so that the n'th } \ \text { roots of unity are } \omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1} \end{array}\]

Some problems to differ DT,DFT,FFT,IFFT

They are Fourier Transform, Discrete Fourier Transform, Fast Fourier Transform and Inverse Fourier Transform.
The transform factor:
$\text { Fact } 1 \omega_{n}^{n}=1 \text { . } \ \text { Fact } 2 \omega_{n}^{k+\frac{n}{2}}=-\omega_{n}^{k} \ \text { Fact } 3\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$

Why we should have the DFT.
Because in theory, all the data stored in computer is Discrete. So we have to use the equation $X(k)=\sum^{N-1}_0x(n)W^{kn}_N(k\in \mathbb{N})$

The Transform factor is used to prove the
1) Periodicity
$W_{N}^{m+l N}=W_{N}^{m},$\) where $\(: W_{N}^{-m}=W_{N}^{N-m}$
2) Symmetry
$W_{N}^{m+\frac{N}{2}}=-W_{N}^{m}$
3) Contractability
$W_{N / m}^{k}=W_{N}^{m k}$
4) Special rotation factors
$W_{N}^{0}=1$

Why Fourier Fast Algorithm (aka FFT)?

FFT is a DFT-based algorithm designed to accelerate the computational speed of DFT.

Given a degree $n$ -1 polynomial
$A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\dots+a_{n-1} x^{n-1}$
DFT computes $A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$
since $A(x)=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots\right) \ldots\right)$
$A(x)$ can be evaluated in $\mathrm{O}(\mathrm{n})$ time and
$\mathrm{O}(1)$ memory.

DFT can be computed trivially in $\mathrm{O}\left(\mathrm{n}^{2}\right)$
time.

For the DFT formula computer implementation the complexity is o(N²), while the complexity by FFT calculation is reduced to: N×log2(N)

What is the sequence split extraction in FFT?

The sequence splitting process of FFT is the extraction process, which is divided into: extraction by time and extraction by frequency.

1) Extraction by time (also called parity extraction)

2) Frequency, which we don’t apply here.

How does FFT reduce the amount of computation?

In simple terms, mathematicians use the above mentioned properties of the rotation factor W such as periodicity, symmetry, etc. to simplify the formula. In algorithmic programming, new points are constantly counted using points that have already been counted, i.e., old points count new points.

Theoretically, FFT computes the DFT in O(n log n) time using divide and conquer.
\square Suppose n is a power of 2.
Let $A^{[0]}=a_{0}+a_{2} x^{1}+a_{4} x^{2}+\cdots+a_{n-2} x^{\frac{n}{2}-1}$
$A^{[1]}=a_{1}+a_{3} x^{1}+a_{5} x^{2}+\cdots+a_{n-1} x^{\frac{n}{2}-1}$
Then $A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)$.
So can compute $A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$ by computing $A^{[0]}$ and $A^{[1]}$
at $\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}$, and multiplying some terms by
$\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}$.
But $\left(\omega_{n}^{k+n / 2}\right)^{2}=\omega_{n}^{2 k+n}=\omega_{n}^{2 k}=\left(\omega_{n}^{k}\right)^{2}$ by Fact 1.
A So $\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}\right\}=\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n / 2-1}\right)^{2}\right\},$ i.e. only need
to evaluate $A^{[0]}$ and $A^{[1]}$ at n/2 points instead of n.
Also, $\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$

Note: Simply splitting a multipoint sequence into a less point sequence without simplification is not a reduction in arithmetic volume!!! Splitting is only in the service of simplification, using the spin factor is the key to arithmetic reduction!!!

Time Complexity:
Thus, computing $A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)$ for
$x \in\left\{\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}\right\}$ requires
$\square$ Computing for $A^{[0]}(x)$ and $A^{[1]}(x)$ for $x \in$
$\left\{\omega_{n / 2}^{0}, \omega_{n / 2}^{1}, \ldots, \omega_{n / 2}^{n / 2-1}\right\}$

These are also DFT’s, so can be done recursively using two
n/2-point FFT’s.
\square For $0 \leq k \leq \frac{n}{2}-1$
\[
\begin{array}{l}\qquad A\left(\omega_{n}^{k}\right)=A^{[0]}\left(\omega_{n / 2}^{k}\right)+\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right) \ \begin{array}{l}\qquad A\left(\omega_{n}^{k+n / 2}\right)=A^{[0]}\left(\omega_{n / 2}^{k+n / 2}\right)+\omega_{n}^{k+n / 2} A^{[1]}\left(\omega_{n / 2}^{k+n / 2}\right) \ =A^{[0]}\left(\omega_{n / 2}^{k}\right)-\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right)\end{array}\end{array}
\]

Butterfly operation?

For a butterfly operation, you can understand it as an operation that is customizable by a graph.

The left side is the input and the right side is the output, for the letters on the horizontal line there are two cases.

1) A number on the left end line (none means C and D are 1).

2)The right end lines have the numerical representation, but if none, C & D are 0s.

The FFT takes out odd and even in accordance to time to change the original sequence. which have to sequentialize it to make the algorithm meet the required sequence. The new sequence is the reverse binary sequence of the original ones.

For example $(a_0 a_4 a_2 a_6 a_1 a_5 a_3 a_7)$ have the binary sequence $(000，100，010，110，001，101，011，111)$.
The reverse order can be simply treated as the change of 2 near binary number, which in this case is:$（a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7）$;

Which is $(000，001，010，011，100，101，110，111)—>(000，100，010，110，001，101，011，111)$.

The code for this transformation:

for(I=0;I<N;I++) // According to law 4, you need to perform inter-code inverse order for all elements of the array
{  
  /* Get the value of the inverse J of subscript I*/ 
  J=0;
  for(k=0;k<(M/2+0.5);k++) //k indicates operation
     {
        /* Reverse sequence operation*/
        m=1;//m is a binary number with a minimum of 1
        n=(unsigned int)pow(2,M-1);//n is a binary number with the Mth degree of 1.
        m <<= k; // for m move left k
        n>>= k; //shift k bits to the right of n
        F0=I & n;//I & n by the kth position of the first half of the extracted
        F1=I & m;//I with m-pressure bit corresponding to the lower part of the second half of the extracted F0
        if(F0) J=J | m; //J and m are in position or so that F0 corresponds to a low position of 1
        if(F1) J=J | n; //J and n are in the same position or so that F1 corresponds to a high level of 1
     }
   if(I<J)
    {
      Temp=A[I];
      A[I] =A[J];
      A[J]=Temp;
    }                                
}

The butter fly operation:
Now let’s imagine that if we want to program the FFT algorithm, the most basic implementation of the FFT algorithm is the butterfly operation, for any butterfly such as.

When N=8.

As can be seen from the above figure, to perform the butterfly operation, we have to solve the following problems.

Interval B between two input data
Determination of the rotation factor W, including.
- Determination of the L-level rotation index P.
- Determination of the type of Level L W.
- The interval between the same W in level L.

$\left\{\begin{array}{l}T_{R}=X_{R}(\mathrm{j}+B) \cos \frac{2 \pi}{N} p+X_{I}(j+B) \sin \frac{2 \pi}{N} p \cdots(1) \ T_{\mathrm{I}}=X_{I}(j+B) \cos \frac{2 \pi}{N} p-X_{R}(j+B) \sin \frac{2 \pi}{N} p \cdots(2) \ \mathrm{A}_{\mathrm{R}}(\mathrm{j})=X_{R}(\mathrm{j})+T_{R} \ \mathrm{A}_{\mathrm{I}}(\mathrm{j})=X_{I}(\mathrm{j})+T_{\mathrm{I}} \ \mathrm{A}_{\mathrm{R}}(\mathrm{j}+\mathrm{B})=X_{R}(\mathrm{j})-T_{R}(5) \ \mathrm{A}_{\mathrm{I}}(\mathrm{j}+\mathrm{B})=X_{I}(\mathrm{j})-T_{\mathrm{I}}(6)\end{array}\right.$

for(L=1; L<=M;L++) //FFT butterfly level L from 1--M
{  
  /* L-level operations*/  
  //First calculate the interval B = 2^(L-1);
  B = 1;
  B = (int)pow(2,L-1);
  for(j=0; j<=B-1;j++)
  {
    /* Homogeneous butterfly operation*/  
    // First increment k=2^(M-L)
    k = (int)pow(2,M-L);
    // Calculate the rotation index p in increments of k, then p = j*k
    P=1;
    P=j*k;
    for(i=0; i<=k-1;i++)
    {
      /* Perform butterfly operations*/
      //Array calibrated to r
      r=1;
      r=j+2*B*i;
      Tr=dataR[r+B]*cos(2.0*PI*p/N) + dataI[r+B]*sin(2.0*PI*p/N);
      Ti=dataI[r+B]*cos(2.0*PI*p/N) - dataR[r+B]*sin(2.0*PI*p/N);
      dataR[r+B]=dataR[r]-Tr;
      dataI[r+B]=dataI[r]-Ti;
      dataR[r]=dataR[r]-Tr; dataI[r+B]=dataI[r]-Ti; dataI[r]-Ti; dataR[r]=dataR[r]+Tr;
      dataI[r]=dataI[r]]+Ti;
    }
  }
}

IFFT is the reverse of the above operation.

What if we take it on the mesh or hypercube to make it scalable on gpu oprations?

Hpercube:

Consider the binary exchange algorithm on a hypercube architecture.
Each processor connected to d others, which differ in each digit of ID.
- Communication only with neighbors, send n/p values each time.
- since d = log p rounds of communication, communication time $T_{c}=$ $t_{s} \log p+t_{w} \frac{7}{p} \log p .$
Each stage does n/p computation.
- Total computation time $T_{p}=\frac{t_{c} n}{p} \log n$.
Efficiency is $E=\frac{T_{p}}{T_{p}+T_{c}}=1 /\left(1+\frac{t_{s} p \log p}{t_{c} n \log n}+\frac{t_{w} \log p}{t_{c} \log n}\right)$
Define $K$ such that $E=1 /(1+1 / K) .$
For isoefficiency, want last two terms in denominator to be constant.
$\frac{t_{s} p \log p}{t_{c} n \log n}=\frac{1}{K}$ implies $n \log n=W=K \frac{t_{s}}{t_{c}} p \log p .$
$\frac{t_{w} \log p}{t_{c} \log n}=\frac{1}{K}$ implies $\log n=K_{t_{c}}^{t_{w}} \log p,$ so $n=p^{K t_{w} / t_{c}},$ so
$W=K \frac{t_{w}}{t_{c}} p^{K t_{w} / t_{c}} \log p$

The efficiency for this case depends on $t_c,t_s,t_w$, the wait time is the tradeoffs between two different lines. which is:

From the Efficiency law
, we have once $\frac{Kt_w}{t_c}>1$, the increasing time is polynomial with regard to the processor count.

Mese:

2D transpose FFT: The best:

Reference

Prof Fan’s PPT
Implement FFT in C by GoKing
Implement FFT on QPU rpi3b+ by Andrew Holme

February 25, 2020February 9, 2022

[Parallel Computing] cuda 默认流&非默认流（异步流和可视化）分析

并发 cuda stream
默认流具有更高的优先级，在QuEST中默认流为Statevec_compactUnitaryKernel 函数，同时占用的gpu时间和运算重点都在此核函数上。

给定流中的操作会按序执行。

就不同非默认流中的操作而言，无法保证其会按彼此之间的任何特定顺序执行。
默认流会受到阻碍，并在其他所有流完成之后方可运行，但其亦会阻碍其他流的运行直至其自身已运行完毕。

定义非默认流，把cudaStream_t stream; 作为参数传递，让编译器自动完成默认流的创建和复制。
用cudaMallocManaged 和 cudaMemPrefetchAsync

更详细的内存管理：

cudaMalloc GPU分配内存
cudaMallocHost 把内存分配在cpu上

第三个杀手锏 cudaMemcpyAsync

February 15, 2020February 11, 2022

[Parallel Computing] 优化内存分配

February 15, 2020February 9, 2022

sgemm implementation in CAS-PRA

Reference:

https://github.com/fsword73/SGEMM_on_VEGA/blob/master/sgemm_sqc_test.cpp
https://github.com/NervanaSystems/maxas/wiki/SGEMM
https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/2_Cookbook/10_inline_asm

Device:

https://gist.github.com/victoryang00/cd6324acffc5ac79464d8409f768656a

outher good reference:

https://cas-pra.sugon.com/detail.html?tournament_id=6

February 14, 2020February 9, 2022

[Parallel Computing] how to optimize sgemm strided

optimize sgemm strided basic theory : https://ieeexplore.ieee.org/document/7839684

February 12, 2020

如何使用nccl （更好的）代替mpi（rdma）

我们在开科学计算的选项的时候总是会碰到这种问题：内存如何分配。尤其是用到gpu的计算的时候。很难有减少内存传输overhead 的办法。正如这次asc quest题目作者所说的：

Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register.

There are a few ways you can have "multiple GPUs":

multiple NVlinked GPUs
This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently.
multiple local GPUs
This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no!
In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead!
Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets!
There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware!
Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation.
distributed GPUs
This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge.
Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile.

作者没有实现distributed gpu的主要原因是他觉得显存带宽复制的时间overhead比较大。不如就在单gpu上完成就行了。但是未来随着gpu显存的不断增加和计算的不断增加，这个放在多卡上的需求也与日俱增。

翻了很多关于gpu显卡的通讯，无论是单node 多卡还是多node多卡。最绕不开的就是mpi。但是当我找到一个更优秀的基于 infiniband 网卡的rdma的数据共享协议，让我眼前一亮，决定就用这个。如果你不是要写cuda 而是pytorch 请绕步pytorch distributed doc，如果是python 版本的nccl 可以选择

参考文档：

主要的步骤：先照着nccl文档安装nv_peer_memory, 再装nccl 最后装plugin。

安装后之后有两个test 一个是 gdrcopy 另一个是 nccl-tests。跑的命令是 mpirun -N 1 --allow-run-as-root --hostfile host -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_HCA=mlx4_0 -x NCCL_SOCKET_IFNAME=ib0 -x LD_LIBRARY_PATH=/usr/local/nccl/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 16M -e 1024M -g 4

February 4, 2020March 1, 2020