[Parallel Computig] Cuda intrinsics plus
Parallel Computing Final Project FineTunedHashML Spring 2020
The final repo is stored at gitlab, added some hash function to the original algorithm.
Dear Professor,
I’m Yiwei Victor Yang 2018533218 from your parallel computing class.
My teammate and I are struggling in selecting the project. I proposed to set up a projects transplant of a CPU LSHDeep learning (https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE), which is a technique newly proposed by Rice University and Intel co, ltd to utilize the avx512 instruction set and huge pages intrinsic in locality sensitive hashing which is the main overhead of the project. ( https://arxiv.org/pdf/1903.03129.pdf)
SLIDE is a project to reimplement BP on the CPU. After the LSH preprocessing, the paper proposed to utilize maximum inner product search (MIPS) which mainly do the Sparse matrix operations Sparse Backpropagation or Gradient Update.
After the update of weight stored in CSR format Spmv, the paper do the Batch computing by OpenMP.
As research in the paper, we can see that the main overhead is preprocessing on CPU avx512 with huge pages cacheheap optimization on, which gained around 30% faster than the raw GPU Tesla V100. If we transplant that part(dynamic hashing) in GPU which is said not applicable by the author from intel, I think it may have a 50 percent chance to be faster than the current CPU version.
As the paper’s reference paper puts it: it gains something like 2 times faster after porting to Cuda, which triggers conflicts. https://github.com/srcd/minhashcuda
Our proposal is to make a comparison amid the raw Tensorflow on GPU, SLIDE on avx512, and SLIDE on GPU using the dataset mentioned in the paper.
My question is whether the transplant of the CPU dynamic cuckoo in the SLIDE do you think can have real speed up than the original version and if we attempted to transplant the program and get little speed up, will we eventually get the score?
Thanks a lot!
[Parallel Computing] Loop dependence analysis
Intro
shared memory algorithm design
For nonshared memory algorithms, we have to utilize bararrier
to gain data transformed.
for shared memory database algorithm, we have to decide the group part of the task that shared memory a lot . Then just apply the inserting directives of omp
or mpi
Design considerations
two main considerations lies in data dependence and load balance so we have to apply the following steps
 data dependence analysis
 static r dynamic and block r cyclic work assignment
 variable specification whether using shared private r reduction and rowwise r columnwise
 shared variables cause cache coherence traffic and much lower performance
 private and reduction variables don't need synchronization
 dimension mapping is more relying on the cache locality
Main Consideration for data dependence analysis
RAW
& WAR
& WAW
all the collations should be avoided though they might run into right situations
Goal is to we should run all the dependent situation on the same processor.
loop dependence analysis
 loopcarried dependence
 dependence exists across different iterations of loop
 loopindependent dependence
 dependence exists within the same iteration of loop
example
iterationspace traversal graph (ITG)
 iterationspace traversal graph is a line graph showing the order of traversal in the iteration space.
loopcarried dependence graph (LDG)
 Given the ITG, can determine the dependence between different loops.
 Loopcarried Dependence Graph (LDG) shows the loopcarried true/anti/output dependence relationships.
 Node in LDG is a point in the iteration space.
 Directed edge in LDG is the dependence.
 LDG helps identify parts of the loop that can be done in parallel.
examples
Distance and direction vector
Loop
Algorithm analysis
[Parallel Computing] SpMV
Intro
 Sparse matrix vector multiplication.
 Many scientific algorithms require multiplying a matrix by a vector.
 Optimization (e.g. conjugate gradient
low degree mesh
.), iterative methods (solving linear systems), eigenvalue methods (e.g. graph partitioning vx), simulations (e.g. finite elements), data analysis (e.g. Pagerankthe web connectivity matrix
).
 Optimization (e.g. conjugate gradient
 The matrices are often sparse.
 In an nxn matrix, there are \(O(n^2)\) nonzero elements.
DIA Format
ELL format
COO format
CSR format
Hybird format
what to use
ELL kernel
CSR scalar vector kernel
CSR vector kernel
COO vector kernel
[Parallel Computing] Concurrency bugs
 In parallel system, high performance achieved by using algorithm with high parallelism, good load balancing, memory locality, etc.
 But must also ensure multiple concurrent threads / processes operate correctly.
 Concurrency bugs can arise due to unexpected interleaving of concurrent threads.
 One of the most difficult issues to deal with in parallel / distributed computing.
 Bugs occur at random times depending on the interleaving.
 Bugs don’t occur during testing, but they will eventually occur in system deployed system.
 Humans have hard time anticipating and resolving concurrency bugs.Concurrency bugs can have very serious consequences.
Therac25 radiation therapy system had a concurrency bug that led to radiation overdose and death of several patients.
Space shuttle aborted 20 minutes before maiden launch due to concurrency bug in its avionics software.
eliminating concurrency bugs
 Multiple ways, each with pros and cons.
 Critical sections and locks
 Prevent processes from accessing a block of code at the same time.
 Easy to use, effective for some problems.
 But cause contention, overhead and serialization.
 Need to decide how much code to lock.
 Too little, and may still get concurrency bug.
 Too much, and we lose parallelism and performance.
 If processes acquire several locks, they need to coordinate to maintain correctness, avoid deadlock.
 Low priority that acquires a lock can delay high priority thread (priority inversion).
 Despite these problems, locks are still the most widely used solution.
 transactional memory
 A block of code is defined as a transaction, i.e. the block of code either executes atomically or doesn’t execute at all.
 Keep track of reads and writes done by a transaction. If two concurrent transactions read and write to same memory location, abort one of them, i.e. undo all the changes it made.
 Two concurrent transactions accessing different memory locations can both commit, i.e. all the changes it made are made permanent.
 Transactional memory can either be implemented in hardware (HTM) or software (STM).
 HTM has limits of size and type of transactions it can handle. Implemented in e.g. Intel Haswell, IBM Power8.
 STM is more flexible, but can be very slow.
 Write your own concurrent code, without hardware support.
 Challenging for most programmers. Not scalable in terms of productivity.
 Correct, efficient algorithms are often research level publications.
Mutual exclution
 Given
n
concurrent processes that want to perform a critical section (CS), mutual exclusion can satisfy the following properties. No two processes are in CS at same time.
 If several processes want to enter the CS, at least one succeeds in finite time (deadlock freedom).
 If several processes want to enter the CS, every process succeeds in finite time (wait freedom).
 All (useful) mutex algorithms satisfy first and second properties.
 Some algorithms satisfy the third property, but have lower performance.
Mutual exclusion algorithms
 Mutex is provided by locks. But how are locks implemented?
 Depends on the type of operations the underlying hardware supports.
 First type of algorithm uses only read / write operations.
 Second type uses hardware synchronization primitives such as testand set (TAS) or compareandswap (CAS), provided in most processors.
 TAS(x) tests if a Boolean variable x is true.
 If x == false, it sets x to true.
 Returns x’s value before the TS.
 All these steps done atomically, without interruption from other threads.
 getAndSet(x) is like TAS(x), but allows nonBoolean x.
 CAS(x,v,v’) tests if variable x currently equals v. If so, it sets x to v’. Otherwise, it doesn’t change x. It also returns x’s current value.
 Again, all this is atomic.
 Algorithms also depend on a processor’s memory model.
 Some processors reorder instructions to avoid stalls and obtain higher performance. This can break many lock algorithms.
 Most lock algorithms assume memory model is sequentially consistent, i.e. the execution order of instructions from different processes is an interleaving of the instructions of each process in program order.
first attempt
[Parallel Computing] Shared Memory Programming and OpenMP
shared memory model compared with mpi
 Threads (e.g. Pthreads, Java)
 individual sequences of instructions that can execute in parallel and access shared data
 very general, because programmer have to manage everything.
 parallel programming language/ library
 a parallel language or library is used to create code
 require new compiler
 Compiler directives (OMP)
 inserts compiler directives into a sequential program to specify parallelism and indicate shared data and the compiler translates into threads.
 still uses threads underneath, but system manages the threads.
 easy to program. but lose flexibility.
Processes and thread
 Process(MPI)
 separate program with its own variable, memory, stack and instruction pointer.
 Different programs can't access each other's memory
 thread(OMP)
 Concurrent routine that shares the variables and memory space, but has its own stack and instruction pointer.
[Parallel Comuting] Cuda intrinsics
Reference: Cuda dev blog
why we introduce intrinsics

Race condition: Different outcomes depending on execution order.

Race conditions can occur in any concurrent system, including GPUs.
 Code should print a =1,000,000.
 But actually printed a = 88.

GPUs have atomic intrinsics for simple atomic operations.

Hard to implement general mutex in CUDA.
 Codes using critical sections don’t perform well on GPUs anyway.
The atomic thread in critical section is not good in performance. The example is identified for variable d_a
cuda atomics
Can perform atomics on global or shared memory variables.
 int atomicInc(int *addr)
 Reads value at addr, increments it, returns old value.
 Hardware ensures all 3 instructions happen without interruption from any other thread.
 int atomicAdd(int *addr, int val)
 Reads value at addr, adds val to it, returns old value.
 int atomicMax(int *addr, int val)
 Reads value at addr, sets it to max of current value and val, returns old value.
 int atomicExch(int *addr1, int val)
 Sets val at addr to val, returns old value at val.
 int atomicCAS(int *addr, old, new)
 “Compare and swap”, a conditional atomic.
 Reads value at addr. If value equals old, sets value to new. Else does nothing.
 Indicates whether state changed, i.e. if your view is up to date.
 Universal operation, i.e. can be used to perform any other kind of synchronization
examples
Finding max of array
improve it
we can improve it bu split the single global max into num_locals
number of local max value.
Thread i
atomically maxes with its local max. can max the local_max[locali]
,
a better solution is to make it into the tree DS+CA
Compute the histogram
! whether the data is stored on shared or global memory is depend on programmers
Shuffle intrinsics
! this intrinsic is not limited to cuda but all SIMD architecture
from avx256 vector elements we have
__m256 load_rotr(float *src) { #ifdef __AVX2__ __m256 orig = _mm256_loadu_ps(src); __m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1)); return rotated_right; #else __m256 shifted = _mm256_loadu_ps(src + 1); __m256 bcast = _mm256_set1_ps(*src); return _mm256_blend_ps(shifted, bcast, 0b10000000); #endif }
For Kepler architecture, we have 4 intrisics.
The goal of the shuffle intrinsics is actually for optimizing MemoryFetch Model
butterfly operations
Assume we have at most shuffle instead of shared memory
[Parallel Computing] Implementing collective communication
Collective communication
 Important as the basis for many parallel algorithms.
 Implemented using multiple point to point communications, possibly in parallel.
 Assume communicating m words (without contention) between a source and destination takes \(t_{s} + m t_{w}\) time.
 Ignore the the distance of the message, since per hop latency th is usually small.
 With contention c, time becomes \(t_{s} + c m t_{w}\).
 Implementations depend on communication hardware architecture.
 Operations
 Broadcast and reduction.
 Alltoall broadcast and reduction.
 Allreduce, prefix sum.
 Scatter and gather.
 Alltoall scatter and reduce.
 Circular shift.
Broadcast and reduction on ring
just cut them into halves.
cost
 With p processors, log p steps.
 In each step, a processor sends a size m message.
 Total time$ (t_{s} + m t_{w}) log p$ take \((t_{s} + m t_{w})\) as const
Broadcast on mesh
algorithm
 Root first broadcasts along its row using ring algorithm.
 The nodes of root’s row then broadcast along their columns using ring algorithm.
cost
 \(log \sqrt{p} = (log p) / 2\) steps along row, same along columns.
 Total time (ts + m tw) log p.log
 p = (log p) / 2 steps along row, same along columns.
 Total time (ts + m tw) log p.
Broadcast on hypercube
algorithm
RING CAN BE PROJECTED TO THE HYPERCUBE but with more congestion.
prefix sum on hypercube
parallel programing 23 parallel fft
One of the most import algorithm in the new century
Intro to the fft
The basic equation of the FFT
\(F(\omega)=Ff(t)=\int ^{+\infty}_{\infty}f(t)e^{j\omega t}dt\)
Roots of unity
\[
\begin{array}{l}\text { An n'th root of unity is an } \omega \text { s.t. } \ \omega^{n}=1 \text { . } \ \text { There are n roots n'th roots of } \ \text { unity, and they have the form } \ e^{\frac{2 \pi i k}{n}}, \text { for } 0 \leq k \leq n1 \text { . } \ \text {Write } \omega_{n}=e^{\frac{2 \pi i}{n}} \text { , so that the n'th } \ \text { roots of unity are } \omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n1} \end{array}\]
Some problems to differ DT,DFT,FFT,IFFT
They are Fourier Transform, Discrete Fourier Transform, Fast Fourier Transform and Inverse Fourier Transform.
The transform factor:
\(\text { Fact } 1 \omega_{n}^{n}=1 \text { . } \ \text { Fact } 2 \omega_{n}^{k+\frac{n}{2}}=\omega_{n}^{k} \ \text { Fact } 3\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}\)
Why we should have the DFT.
Because in theory, all the data stored in computer is Discrete. So we have to use the equation \(X(k)=\sum^{N1}_0x(n)W^{kn}_N(k\in \mathbb{N})\)
The Transform factor is used to prove the
1) Periodicity
\(W_{N}^{m+l N}=W_{N}^{m},\)\) where \(\(: W_{N}^{m}=W_{N}^{Nm}\)
2) Symmetry
\(W_{N}^{m+\frac{N}{2}}=W_{N}^{m}\)
3) Contractability
\(W_{N / m}^{k}=W_{N}^{m k}\)
4) Special rotation factors
\(W_{N}^{0}=1\)
 Why Fourier Fast Algorithm (aka FFT)?
FFT is a DFTbased algorithm designed to accelerate the computational speed of DFT.
Given a degree \(n\) 1 polynomial
\(A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\dots+a_{n1} x^{n1}\)
DFT computes \(A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n1}\right)\)
since \(A(x)=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots\right) \ldots\right)\)
\(A(x)\) can be evaluated in \(\mathrm{O}(\mathrm{n})\) time and
\(\mathrm{O}(1)\) memory.
 DFT can be computed trivially in \(\mathrm{O}\left(\mathrm{n}^{2}\right)\)
time.
For the DFT formula computer implementation the complexity is o(N²), while the complexity by FFT calculation is reduced to: N×log2(N)
 What is the sequence split extraction in FFT?
The sequence splitting process of FFT is the extraction process, which is divided into: extraction by time and extraction by frequency.
1) Extraction by time (also called parity extraction)
2) Frequency, which we don’t apply here.
 How does FFT reduce the amount of computation?
In simple terms, mathematicians use the above mentioned properties of the rotation factor W such as periodicity, symmetry, etc. to simplify the formula. In algorithmic programming, new points are constantly counted using points that have already been counted, i.e., old points count new points.
Theoretically, FFT computes the DFT in O(n log n) time using divide and conquer.
\square Suppose n is a power of 2.
Let \(A^{[0]}=a_{0}+a_{2} x^{1}+a_{4} x^{2}+\cdots+a_{n2} x^{\frac{n}{2}1}\)
\(A^{[1]}=a_{1}+a_{3} x^{1}+a_{5} x^{2}+\cdots+a_{n1} x^{\frac{n}{2}1}\)
Then \(A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)\).
So can compute \(A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n1}\right)\) by computing \(A^{[0]}\) and \(A^{[1]}\)
at \(\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n1}\right)^{2}\), and multiplying some terms by
\(\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n1}\).
But \(\left(\omega_{n}^{k+n / 2}\right)^{2}=\omega_{n}^{2 k+n}=\omega_{n}^{2 k}=\left(\omega_{n}^{k}\right)^{2}\) by Fact 1.
A So \(\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n1}\right)^{2}\right\}=\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n / 21}\right)^{2}\right\},\) i.e. only need
to evaluate \(A^{[0]}\) and \(A^{[1]}\) at n/2 points instead of n.
Also, \(\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}\)
Note: Simply splitting a multipoint sequence into a less point sequence without simplification is not a reduction in arithmetic volume!!! Splitting is only in the service of simplification, using the spin factor is the key to arithmetic reduction!!!
Time Complexity:
Thus, computing \(A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)\) for
\(x \in\left\{\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n1}\right\}\) requires
\(\square\) Computing for \(A^{[0]}(x)\) and \(A^{[1]}(x)\) for \(x \in\)
\(\left\{\omega_{n / 2}^{0}, \omega_{n / 2}^{1}, \ldots, \omega_{n / 2}^{n / 21}\right\}\)

These are also DFT’s, so can be done recursively using two
n/2point FFT’s.
\square For \(0 \leq k \leq \frac{n}{2}1\)
\[
\begin{array}{l}\qquad A\left(\omega_{n}^{k}\right)=A^{[0]}\left(\omega_{n / 2}^{k}\right)+\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right) \ \begin{array}{l}\qquad A\left(\omega_{n}^{k+n / 2}\right)=A^{[0]}\left(\omega_{n / 2}^{k+n / 2}\right)+\omega_{n}^{k+n / 2} A^{[1]}\left(\omega_{n / 2}^{k+n / 2}\right) \ =A^{[0]}\left(\omega_{n / 2}^{k}\right)\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right)\end{array}\end{array}
\]
 Butterfly operation?
For a butterfly operation, you can understand it as an operation that is customizable by a graph.
The left side is the input and the right side is the output, for the letters on the horizontal line there are two cases.
1) A number on the left end line (none means C and D are 1).
2)The right end lines have the numerical representation, but if none, C & D are 0s.
The FFT takes out odd and even in accordance to time to change the original sequence. which have to sequentialize it to make the algorithm meet the required sequence. The new sequence is the reverse binary sequence of the original ones.
For example \((a_0 a_4 a_2 a_6 a_1 a_5 a_3 a_7)\) have the binary sequence \((000，100，010，110，001，101，011，111)\).
The reverse order can be simply treated as the change of 2 near binary number, which in this case is:\(（a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7）\);
Which is \((000，001，010，011，100，101，110，111)—>(000，100，010，110，001，101，011，111)\).
The code for this transformation:
for(I=0;I<N;I++) // According to law 4, you need to perform intercode inverse order for all elements of the array
{
/* Get the value of the inverse J of subscript I*/
J=0;
for(k=0;k<(M/2+0.5);k++) //k indicates operation
{
/* Reverse sequence operation*/
m=1;//m is a binary number with a minimum of 1
n=(unsigned int)pow(2,M1);//n is a binary number with the Mth degree of 1.
m <<= k; // for m move left k
n>>= k; //shift k bits to the right of n
F0=I & n;//I & n by the kth position of the first half of the extracted
F1=I & m;//I with mpressure bit corresponding to the lower part of the second half of the extracted F0
if(F0) J=J  m; //J and m are in position or so that F0 corresponds to a low position of 1
if(F1) J=J  n; //J and n are in the same position or so that F1 corresponds to a high level of 1
}
if(I<J)
{
Temp=A[I];
A[I] =A[J];
A[J]=Temp;
}
}
The butter fly operation:
Now let’s imagine that if we want to program the FFT algorithm, the most basic implementation of the FFT algorithm is the butterfly operation, for any butterfly such as.
When N=8.
As can be seen from the above figure, to perform the butterfly operation, we have to solve the following problems.

Interval B between two input data

Determination of the rotation factor W, including.
 Determination of the Llevel rotation index P.
 Determination of the type of Level L W.
 The interval between the same W in level L.
\(\left\{\begin{array}{l}T_{R}=X_{R}(\mathrm{j}+B) \cos \frac{2 \pi}{N} p+X_{I}(j+B) \sin \frac{2 \pi}{N} p \cdots(1) \ T_{\mathrm{I}}=X_{I}(j+B) \cos \frac{2 \pi}{N} pX_{R}(j+B) \sin \frac{2 \pi}{N} p \cdots(2) \ \mathrm{A}_{\mathrm{R}}(\mathrm{j})=X_{R}(\mathrm{j})+T_{R} \ \mathrm{A}_{\mathrm{I}}(\mathrm{j})=X_{I}(\mathrm{j})+T_{\mathrm{I}} \ \mathrm{A}_{\mathrm{R}}(\mathrm{j}+\mathrm{B})=X_{R}(\mathrm{j})T_{R}(5) \ \mathrm{A}_{\mathrm{I}}(\mathrm{j}+\mathrm{B})=X_{I}(\mathrm{j})T_{\mathrm{I}}(6)\end{array}\right.\)
for(L=1; L<=M;L++) //FFT butterfly level L from 1M
{
/* Llevel operations*/
//First calculate the interval B = 2^(L1);
B = 1;
B = (int)pow(2,L1);
for(j=0; j<=B1;j++)
{
/* Homogeneous butterfly operation*/
// First increment k=2^(ML)
k = (int)pow(2,ML);
// Calculate the rotation index p in increments of k, then p = j*k
P=1;
P=j*k;
for(i=0; i<=k1;i++)
{
/* Perform butterfly operations*/
//Array calibrated to r
r=1;
r=j+2*B*i;
Tr=dataR[r+B]*cos(2.0*PI*p/N) + dataI[r+B]*sin(2.0*PI*p/N);
Ti=dataI[r+B]*cos(2.0*PI*p/N)  dataR[r+B]*sin(2.0*PI*p/N);
dataR[r+B]=dataR[r]Tr;
dataI[r+B]=dataI[r]Ti;
dataR[r]=dataR[r]Tr; dataI[r+B]=dataI[r]Ti; dataI[r]Ti; dataR[r]=dataR[r]+Tr;
dataI[r]=dataI[r]]+Ti;
}
}
}
IFFT is the reverse of the above operation.
 What if we take it on the mesh or hypercube to make it scalable on gpu oprations?
Hpercube:
 Consider the binary exchange algorithm on a hypercube architecture.
 Each processor connected to d others, which differ in each digit of ID.
 Communication only with neighbors, send n/p values each time.
 since d = log p rounds of communication, communication time \(T_{c}=\) \(t_{s} \log p+t_{w} \frac{7}{p} \log p .\)
 Each stage does n/p computation.
 Total computation time \(T_{p}=\frac{t_{c} n}{p} \log n\).
 Efficiency is \(E=\frac{T_{p}}{T_{p}+T_{c}}=1 /\left(1+\frac{t_{s} p \log p}{t_{c} n \log n}+\frac{t_{w} \log p}{t_{c} \log n}\right)\)
 Define \(K\) such that \(E=1 /(1+1 / K) .\)
 For isoefficiency, want last two terms in denominator to be constant.
 \(\frac{t_{s} p \log p}{t_{c} n \log n}=\frac{1}{K}\) implies \(n \log n=W=K \frac{t_{s}}{t_{c}} p \log p .\)
 \(\frac{t_{w} \log p}{t_{c} \log n}=\frac{1}{K}\) implies \(\log n=K_{t_{c}}^{t_{w}} \log p,\) so \(n=p^{K t_{w} / t_{c}},\) so
 \(W=K \frac{t_{w}}{t_{c}} p^{K t_{w} / t_{c}} \log p\)
The efficiency for this case depends on \(t_c,t_s,t_w\), the wait time is the tradeoffs between two different lines. which is:
From the Efficiency law
, we have once \(\frac{Kt_w}{t_c}>1\), the increasing time is polynomial with regard to the processor count.
Mese:
2D transpose FFT: The best: