## One of the most import algorithm in the new century ### Intro to the fft

#### The basic equation of the FFT

$$F(\omega)=F|f(t)|=\int ^{+\infty}_{-\infty}f(t)e^{-j\omega t}dt$$

## Roots of unity

$\begin{array}{l}\text { An n'th root of unity is an } \omega \text { s.t. } \ \omega^{n}=1 \text { . } \ \text { There are n roots n'th roots of } \ \text { unity, and they have the form } \ e^{\frac{2 \pi i k}{n}}, \text { for } 0 \leq k \leq n-1 \text { . } \ \text {Write } \omega_{n}=e^{\frac{2 \pi i}{n}} \text { , so that the n'th } \ \text { roots of unity are } \omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1} \end{array}$

##### Some problems to differ DT,DFT,FFT,IFFT

They are Fourier Transform, Discrete Fourier Transform, Fast Fourier Transform and Inverse Fourier Transform.
The transform factor:
$$\text { Fact } 1 \omega_{n}^{n}=1 \text { . } \ \text { Fact } 2 \omega_{n}^{k+\frac{n}{2}}=-\omega_{n}^{k} \ \text { Fact } 3\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$$

Why we should have the DFT.
Because in theory, all the data stored in computer is Discrete. So we have to use the equation $$X(k)=\sum^{N-1}_0x(n)W^{kn}_N(k\in \mathbb{N})$$ The Transform factor is used to prove the
1) Periodicity
$$W_{N}^{m+l N}=W_{N}^{m},$$\) where $$\(: W_{N}^{-m}=W_{N}^{N-m}$$
2) Symmetry
$$W_{N}^{m+\frac{N}{2}}=-W_{N}^{m}$$
3) Contractability
$$W_{N / m}^{k}=W_{N}^{m k}$$
4) Special rotation factors
$$W_{N}^{0}=1$$

1. Why Fourier Fast Algorithm (aka FFT)?

FFT is a DFT-based algorithm designed to accelerate the computational speed of DFT.

Given a degree $$n$$ -1 polynomial
$$A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\dots+a_{n-1} x^{n-1}$$
DFT computes $$A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$$
since $$A(x)=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots\right) \ldots\right)$$
$$A(x)$$ can be evaluated in $$\mathrm{O}(\mathrm{n})$$ time and
$$\mathrm{O}(1)$$ memory.

• DFT can be computed trivially in $$\mathrm{O}\left(\mathrm{n}^{2}\right)$$
time.

For the DFT formula computer implementation the complexity is o(N²), while the complexity by FFT calculation is reduced to: N×log2(N)

1. What is the sequence split extraction in FFT?

The sequence splitting process of FFT is the extraction process, which is divided into: extraction by time and extraction by frequency.

1) Extraction by time (also called parity extraction)  2) Frequency, which we don’t apply here.

1. How does FFT reduce the amount of computation?

In simple terms, mathematicians use the above mentioned properties of the rotation factor W such as periodicity, symmetry, etc. to simplify the formula. In algorithmic programming, new points are constantly counted using points that have already been counted, i.e., old points count new points.

Theoretically, FFT computes the DFT in O(n log n) time using divide and conquer.
\square Suppose n is a power of 2.
Let $$A^{}=a_{0}+a_{2} x^{1}+a_{4} x^{2}+\cdots+a_{n-2} x^{\frac{n}{2}-1}$$
$$A^{}=a_{1}+a_{3} x^{1}+a_{5} x^{2}+\cdots+a_{n-1} x^{\frac{n}{2}-1}$$
Then $$A(x)=A^{}\left(x^{2}\right)+x A^{}\left(x^{2}\right)$$.
So can compute $$A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)$$ by computing $$A^{}$$ and $$A^{}$$
at $$\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}$$, and multiplying some terms by
$$\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}$$.
But $$\left(\omega_{n}^{k+n / 2}\right)^{2}=\omega_{n}^{2 k+n}=\omega_{n}^{2 k}=\left(\omega_{n}^{k}\right)^{2}$$ by Fact 1.
A So $$\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}\right\}=\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n / 2-1}\right)^{2}\right\},$$ i.e. only need
to evaluate $$A^{}$$ and $$A^{}$$ at n/2 points instead of n.
Also, $$\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}$$

Note: Simply splitting a multipoint sequence into a less point sequence without simplification is not a reduction in arithmetic volume!!! Splitting is only in the service of simplification, using the spin factor is the key to arithmetic reduction!!!

Time Complexity:
Thus, computing $$A(x)=A^{}\left(x^{2}\right)+x A^{}\left(x^{2}\right)$$ for
$$x \in\left\{\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}\right\}$$ requires
$$\square$$ Computing for $$A^{}(x)$$ and $$A^{}(x)$$ for $$x \in$$
$$\left\{\omega_{n / 2}^{0}, \omega_{n / 2}^{1}, \ldots, \omega_{n / 2}^{n / 2-1}\right\}$$

• These are also DFT’s, so can be done recursively using two
n/2-point FFT’s.
\square For $$0 \leq k \leq \frac{n}{2}-1$$
$\begin{array}{l}\qquad A\left(\omega_{n}^{k}\right)=A^{}\left(\omega_{n / 2}^{k}\right)+\omega_{n}^{k} A^{}\left(\omega_{n / 2}^{k}\right) \ \begin{array}{l}\qquad A\left(\omega_{n}^{k+n / 2}\right)=A^{}\left(\omega_{n / 2}^{k+n / 2}\right)+\omega_{n}^{k+n / 2} A^{}\left(\omega_{n / 2}^{k+n / 2}\right) \ =A^{}\left(\omega_{n / 2}^{k}\right)-\omega_{n}^{k} A^{}\left(\omega_{n / 2}^{k}\right)\end{array}\end{array}$

1. Butterfly operation?

For a butterfly operation, you can understand it as an operation that is customizable by a graph.

The left side is the input and the right side is the output, for the letters on the horizontal line there are two cases.

1) A number on the left end line (none means C and D are 1). 2)The right end lines have the numerical representation, but if none, C & D are 0s. The FFT takes out odd and even in accordance to time to change the original sequence. which have to sequentialize it to make the algorithm meet the required sequence. The new sequence is the reverse binary sequence of the original ones.

For example $$(a_0 a_4 a_2 a_6 a_1 a_5 a_3 a_7)$$ have the binary sequence $$(000，100，010，110，001，101，011，111)$$.
The reverse order can be simply treated as the change of 2 near binary number, which in this case is:$$（a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7）$$;

Which is $$(000，001，010，011，100，101，110，111)—>(000，100，010，110，001，101，011，111)$$. The code for this transformation:

for(I=0;I<N;I++) // According to law 4, you need to perform inter-code inverse order for all elements of the array
{
/* Get the value of the inverse J of subscript I*/
J=0;
for(k=0;k<(M/2+0.5);k++) //k indicates operation
{
/* Reverse sequence operation*/
m=1;//m is a binary number with a minimum of 1
n=(unsigned int)pow(2,M-1);//n is a binary number with the Mth degree of 1.
m <<= k; // for m move left k
n>>= k; //shift k bits to the right of n
F0=I & n;//I & n by the kth position of the first half of the extracted
F1=I & m;//I with m-pressure bit corresponding to the lower part of the second half of the extracted F0
if(F0) J=J | m; //J and m are in position or so that F0 corresponds to a low position of 1
if(F1) J=J | n; //J and n are in the same position or so that F1 corresponds to a high level of 1
}
if(I<J)
{
Temp=A[I];
A[I] =A[J];
A[J]=Temp;
}
}


The butter fly operation:
Now let’s imagine that if we want to program the FFT algorithm, the most basic implementation of the FFT algorithm is the butterfly operation, for any butterfly such as.

When N=8. As can be seen from the above figure, to perform the butterfly operation, we have to solve the following problems.

• Interval B between two input data

• Determination of the rotation factor W, including.

• Determination of the L-level rotation index P.
• Determination of the type of Level L W.
• The interval between the same W in level L.

$$\left\{\begin{array}{l}T_{R}=X_{R}(\mathrm{j}+B) \cos \frac{2 \pi}{N} p+X_{I}(j+B) \sin \frac{2 \pi}{N} p \cdots(1) \ T_{\mathrm{I}}=X_{I}(j+B) \cos \frac{2 \pi}{N} p-X_{R}(j+B) \sin \frac{2 \pi}{N} p \cdots(2) \ \mathrm{A}_{\mathrm{R}}(\mathrm{j})=X_{R}(\mathrm{j})+T_{R} \ \mathrm{A}_{\mathrm{I}}(\mathrm{j})=X_{I}(\mathrm{j})+T_{\mathrm{I}} \ \mathrm{A}_{\mathrm{R}}(\mathrm{j}+\mathrm{B})=X_{R}(\mathrm{j})-T_{R}(5) \ \mathrm{A}_{\mathrm{I}}(\mathrm{j}+\mathrm{B})=X_{I}(\mathrm{j})-T_{\mathrm{I}}(6)\end{array}\right.$$

for(L=1; L<=M;L++) //FFT butterfly level L from 1--M
{
/* L-level operations*/
//First calculate the interval B = 2^(L-1);
B = 1;
B = (int)pow(2,L-1);
for(j=0; j<=B-1;j++)
{
/* Homogeneous butterfly operation*/
// First increment k=2^(M-L)
k = (int)pow(2,M-L);
// Calculate the rotation index p in increments of k, then p = j*k
P=1;
P=j*k;
for(i=0; i<=k-1;i++)
{
/* Perform butterfly operations*/
//Array calibrated to r
r=1;
r=j+2*B*i;
Tr=dataR[r+B]*cos(2.0*PI*p/N) + dataI[r+B]*sin(2.0*PI*p/N);
Ti=dataI[r+B]*cos(2.0*PI*p/N) - dataR[r+B]*sin(2.0*PI*p/N);
dataR[r+B]=dataR[r]-Tr;
dataI[r+B]=dataI[r]-Ti;
dataR[r]=dataR[r]-Tr; dataI[r+B]=dataI[r]-Ti; dataI[r]-Ti; dataR[r]=dataR[r]+Tr;
dataI[r]=dataI[r]]+Ti;
}
}
}


IFFT is the reverse of the above operation.

1. What if we take it on the mesh or hypercube to make it scalable on gpu oprations?

Hpercube:

• Consider the binary exchange algorithm on a hypercube architecture.
• Each processor connected to d others, which differ in each digit of ID.
• Communication only with neighbors, send n/p values each time.
• since d = log p rounds of communication, communication time $$T_{c}=$$ $$t_{s} \log p+t_{w} \frac{7}{p} \log p .$$
• Each stage does n/p computation.
• Total computation time $$T_{p}=\frac{t_{c} n}{p} \log n$$.
• Efficiency is $$E=\frac{T_{p}}{T_{p}+T_{c}}=1 /\left(1+\frac{t_{s} p \log p}{t_{c} n \log n}+\frac{t_{w} \log p}{t_{c} \log n}\right)$$
• Define $$K$$ such that $$E=1 /(1+1 / K) .$$
• For isoefficiency, want last two terms in denominator to be constant.
• $$\frac{t_{s} p \log p}{t_{c} n \log n}=\frac{1}{K}$$ implies $$n \log n=W=K \frac{t_{s}}{t_{c}} p \log p .$$
• $$\frac{t_{w} \log p}{t_{c} \log n}=\frac{1}{K}$$ implies $$\log n=K_{t_{c}}^{t_{w}} \log p,$$ so $$n=p^{K t_{w} / t_{c}},$$ so
• $$W=K \frac{t_{w}}{t_{c}} p^{K t_{w} / t_{c}} \log p$$

The efficiency for this case depends on $$t_c,t_s,t_w$$, the wait time is the tradeoffs between two different lines. which is: From the Efficiency law , we have once $$\frac{Kt_w}{t_c}>1$$, the increasing time is polynomial with regard to the processor count.

Mese:  2D transpose FFT: The best:   ## [Parallel Computing] cuda 默认流&非默认流 （异步流和可视化）分析 1. 就不同非默认流中的操作而言，无法保证其会按彼此之间的任何特定顺序执行。
2. 默认流会受到阻碍，并在其他所有流完成之后方可运行，但其亦会阻碍其他流的运行直至其自身已运行完毕。

cudaMallocManagedcudaMemPrefetchAsync

1. cudaMalloc GPU分配内存
2. cudaMallocHost 把内存分配在cpu上 ## [Parallel Computing] how to optimize sgemm strided

optimize sgemm strided basic theory : https://ieeexplore.ieee.org/document/7839684

## 如何使用nccl （更好的）代替mpi（rdma）

Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register.

There are a few ways you can have "multiple GPUs":

This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently.
• multiple local GPUs
This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no!
In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead!
Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets!
There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware!
Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation.
• distributed GPUs
This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge.
Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile.

## [cuda 编程]深入优化

gpu的延时是数百个时钟周期。

sm资源调度模型用于解决访存冲突。有texture 纹理寻址。

single转double

Data Prefetching 数据预读

## [cuda 编程]优化cuda

P.S. 尽量不要写过多的深层的递归，因为在gpu上实现这个是需要开指数级别的线程数，没开一个线程就意味着特定的内存开销，显存很容易被撑满。