Running a Program - CALL(Compiling, Assembling, Linking, and Loading)

Clarifications

Project 1 RISC-v emulator

  1. Risc-v isa does not define assembly syntax
    behave exactly like Venus
  2. All I-Type instructions (including sltiu)
    • do sign-extension
    • (in venus) input number is signed, even if hex

a good tool

Interpretation

  1. any good reason to interpret machine language in software? debbuger Venus
  2. translated/compiled code almost always more efficitent and therfore higher performance:
    • important for many applications, particularly operating systems.

Steps in compiling a C program

Psuedo-instruciton Replacement

Producing Machine Language

  • simple case
    • arithmetic, logical, shifts and so on
    • All necessary info is within the instruction already
  • branches can be optimized

parallel programing 23 parallel fft

One of the most import algorithm in the new century

Intro to the fft

The basic equation of the FFT

\(F(\omega)=F|f(t)|=\int ^{+\infty}_{-\infty}f(t)e^{-j\omega t}dt\)

Roots of unity

\[
\begin{array}{l}\text { An n'th root of unity is an } \omega \text { s.t. } \ \omega^{n}=1 \text { . } \ \text { There are n roots n'th roots of } \ \text { unity, and they have the form } \ e^{\frac{2 \pi i k}{n}}, \text { for } 0 \leq k \leq n-1 \text { . } \ \text {Write } \omega_{n}=e^{\frac{2 \pi i}{n}} \text { , so that the n'th } \ \text { roots of unity are } \omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1} \end{array}\]

Some problems to differ DT,DFT,FFT,IFFT

They are Fourier Transform, Discrete Fourier Transform, Fast Fourier Transform and Inverse Fourier Transform.
The transform factor:
\(\text { Fact } 1 \omega_{n}^{n}=1 \text { . } \ \text { Fact } 2 \omega_{n}^{k+\frac{n}{2}}=-\omega_{n}^{k} \ \text { Fact } 3\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}\)

Why we should have the DFT.
Because in theory, all the data stored in computer is Discrete. So we have to use the equation \(X(k)=\sum^{N-1}_0x(n)W^{kn}_N(k\in \mathbb{N})\)

The Transform factor is used to prove the
1) Periodicity
\(W_{N}^{m+l N}=W_{N}^{m},\)\) where \(\(: W_{N}^{-m}=W_{N}^{N-m}\)
2) Symmetry
\(W_{N}^{m+\frac{N}{2}}=-W_{N}^{m}\)
3) Contractability
\(W_{N / m}^{k}=W_{N}^{m k}\)
4) Special rotation factors
\(W_{N}^{0}=1\)

  1. Why Fourier Fast Algorithm (aka FFT)?

FFT is a DFT-based algorithm designed to accelerate the computational speed of DFT.

Given a degree \(n\) -1 polynomial
\(A(x)=a_{0}+a_{1} x+a_{2} x^{2}+\dots+a_{n-1} x^{n-1}\)
DFT computes \(A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)\)
since \(A(x)=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots\right) \ldots\right)\)
\(A(x)\) can be evaluated in \(\mathrm{O}(\mathrm{n})\) time and
\(\mathrm{O}(1)\) memory.

  • DFT can be computed trivially in \(\mathrm{O}\left(\mathrm{n}^{2}\right)\)
    time.

For the DFT formula computer implementation the complexity is o(N²), while the complexity by FFT calculation is reduced to: N×log2(N)

  1. What is the sequence split extraction in FFT?

The sequence splitting process of FFT is the extraction process, which is divided into: extraction by time and extraction by frequency.

1) Extraction by time (also called parity extraction)


2) Frequency, which we don’t apply here.

  1. How does FFT reduce the amount of computation?

In simple terms, mathematicians use the above mentioned properties of the rotation factor W such as periodicity, symmetry, etc. to simplify the formula. In algorithmic programming, new points are constantly counted using points that have already been counted, i.e., old points count new points.

Theoretically, FFT computes the DFT in O(n log n) time using divide and conquer.
\square Suppose n is a power of 2.
Let \(A^{[0]}=a_{0}+a_{2} x^{1}+a_{4} x^{2}+\cdots+a_{n-2} x^{\frac{n}{2}-1}\)
\(A^{[1]}=a_{1}+a_{3} x^{1}+a_{5} x^{2}+\cdots+a_{n-1} x^{\frac{n}{2}-1}\)
Then \(A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)\).
So can compute \(A\left(\omega_{n}^{0}\right), A\left(\omega_{n}^{1}\right), \ldots, A\left(\omega_{n}^{n-1}\right)\) by computing \(A^{[0]}\) and \(A^{[1]}\)
at \(\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}\), and multiplying some terms by
\(\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}\).
But \(\left(\omega_{n}^{k+n / 2}\right)^{2}=\omega_{n}^{2 k+n}=\omega_{n}^{2 k}=\left(\omega_{n}^{k}\right)^{2}\) by Fact 1.
A So \(\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n-1}\right)^{2}\right\}=\left\{\left(\omega_{n}^{0}\right)^{2},\left(\omega_{n}^{1}\right)^{2}, \ldots,\left(\omega_{n}^{n / 2-1}\right)^{2}\right\},\) i.e. only need
to evaluate \(A^{[0]}\) and \(A^{[1]}\) at n/2 points instead of n.
Also, \(\left(\omega_{n}^{k}\right)^{2}=\omega_{n}^{2 k}=\omega_{n / 2}^{k}\)

Note: Simply splitting a multipoint sequence into a less point sequence without simplification is not a reduction in arithmetic volume!!! Splitting is only in the service of simplification, using the spin factor is the key to arithmetic reduction!!!

Time Complexity:
Thus, computing \(A(x)=A^{[0]}\left(x^{2}\right)+x A^{[1]}\left(x^{2}\right)\) for
\(x \in\left\{\omega_{n}^{0}, \omega_{n}^{1}, \ldots, \omega_{n}^{n-1}\right\}\) requires
\(\square\) Computing for \(A^{[0]}(x)\) and \(A^{[1]}(x)\) for \(x \in\)
\(\left\{\omega_{n / 2}^{0}, \omega_{n / 2}^{1}, \ldots, \omega_{n / 2}^{n / 2-1}\right\}\)

  • These are also DFT’s, so can be done recursively using two
    n/2-point FFT’s.
    \square For \(0 \leq k \leq \frac{n}{2}-1\)
    \[
    \begin{array}{l}\qquad A\left(\omega_{n}^{k}\right)=A^{[0]}\left(\omega_{n / 2}^{k}\right)+\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right) \ \begin{array}{l}\qquad A\left(\omega_{n}^{k+n / 2}\right)=A^{[0]}\left(\omega_{n / 2}^{k+n / 2}\right)+\omega_{n}^{k+n / 2} A^{[1]}\left(\omega_{n / 2}^{k+n / 2}\right) \ =A^{[0]}\left(\omega_{n / 2}^{k}\right)-\omega_{n}^{k} A^{[1]}\left(\omega_{n / 2}^{k}\right)\end{array}\end{array}
    \]

  1. Butterfly operation?

For a butterfly operation, you can understand it as an operation that is customizable by a graph.

The left side is the input and the right side is the output, for the letters on the horizontal line there are two cases.

1) A number on the left end line (none means C and D are 1).

2)The right end lines have the numerical representation, but if none, C & D are 0s.

The FFT takes out odd and even in accordance to time to change the original sequence. which have to sequentialize it to make the algorithm meet the required sequence. The new sequence is the reverse binary sequence of the original ones.

For example \((a_0 a_4 a_2 a_6 a_1 a_5 a_3 a_7)\) have the binary sequence \((000,100,010,110,001,101,011,111)\).
The reverse order can be simply treated as the change of 2 near binary number, which in this case is:\((a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7)\);

Which is \((000,001,010,011,100,101,110,111)—>(000,100,010,110,001,101,011,111)\).

The code for this transformation:

for(I=0;I<N;I++) // According to law 4, you need to perform inter-code inverse order for all elements of the array
{  
  /* Get the value of the inverse J of subscript I*/ 
  J=0;
  for(k=0;k<(M/2+0.5);k++) //k indicates operation
     {
        /* Reverse sequence operation*/
        m=1;//m is a binary number with a minimum of 1
        n=(unsigned int)pow(2,M-1);//n is a binary number with the Mth degree of 1.
        m <<= k; // for m move left k
        n>>= k; //shift k bits to the right of n
        F0=I & n;//I & n by the kth position of the first half of the extracted
        F1=I & m;//I with m-pressure bit corresponding to the lower part of the second half of the extracted F0
        if(F0) J=J | m; //J and m are in position or so that F0 corresponds to a low position of 1
        if(F1) J=J | n; //J and n are in the same position or so that F1 corresponds to a high level of 1
     }
   if(I<J)
    {
      Temp=A[I];
      A[I] =A[J];
      A[J]=Temp;
    }                                
}

The butter fly operation:
Now let’s imagine that if we want to program the FFT algorithm, the most basic implementation of the FFT algorithm is the butterfly operation, for any butterfly such as.

When N=8.

As can be seen from the above figure, to perform the butterfly operation, we have to solve the following problems.

  • Interval B between two input data

  • Determination of the rotation factor W, including.

    • Determination of the L-level rotation index P.
    • Determination of the type of Level L W.
    • The interval between the same W in level L.

\(\left\{\begin{array}{l}T_{R}=X_{R}(\mathrm{j}+B) \cos \frac{2 \pi}{N} p+X_{I}(j+B) \sin \frac{2 \pi}{N} p \cdots(1) \ T_{\mathrm{I}}=X_{I}(j+B) \cos \frac{2 \pi}{N} p-X_{R}(j+B) \sin \frac{2 \pi}{N} p \cdots(2) \ \mathrm{A}_{\mathrm{R}}(\mathrm{j})=X_{R}(\mathrm{j})+T_{R} \ \mathrm{A}_{\mathrm{I}}(\mathrm{j})=X_{I}(\mathrm{j})+T_{\mathrm{I}} \ \mathrm{A}_{\mathrm{R}}(\mathrm{j}+\mathrm{B})=X_{R}(\mathrm{j})-T_{R}(5) \ \mathrm{A}_{\mathrm{I}}(\mathrm{j}+\mathrm{B})=X_{I}(\mathrm{j})-T_{\mathrm{I}}(6)\end{array}\right.\)

for(L=1; L<=M;L++) //FFT butterfly level L from 1--M
{  
  /* L-level operations*/  
  //First calculate the interval B = 2^(L-1);
  B = 1;
  B = (int)pow(2,L-1);
  for(j=0; j<=B-1;j++)
  {
    /* Homogeneous butterfly operation*/  
    // First increment k=2^(M-L)
    k = (int)pow(2,M-L);
    // Calculate the rotation index p in increments of k, then p = j*k
    P=1;
    P=j*k;
    for(i=0; i<=k-1;i++)
    {
      /* Perform butterfly operations*/
      //Array calibrated to r
      r=1;
      r=j+2*B*i;
      Tr=dataR[r+B]*cos(2.0*PI*p/N) + dataI[r+B]*sin(2.0*PI*p/N);
      Ti=dataI[r+B]*cos(2.0*PI*p/N) - dataR[r+B]*sin(2.0*PI*p/N);
      dataR[r+B]=dataR[r]-Tr;
      dataI[r+B]=dataI[r]-Ti;
      dataR[r]=dataR[r]-Tr; dataI[r+B]=dataI[r]-Ti; dataI[r]-Ti; dataR[r]=dataR[r]+Tr;
      dataI[r]=dataI[r]]+Ti;
    }
  }
}

IFFT is the reverse of the above operation.

  1. What if we take it on the mesh or hypercube to make it scalable on gpu oprations?

Hpercube:

  • Consider the binary exchange algorithm on a hypercube architecture.
  • Each processor connected to d others, which differ in each digit of ID.
    • Communication only with neighbors, send n/p values each time.
    • since d = log p rounds of communication, communication time \(T_{c}=\) \(t_{s} \log p+t_{w} \frac{7}{p} \log p .\)
  • Each stage does n/p computation.
    • Total computation time \(T_{p}=\frac{t_{c} n}{p} \log n\).
  • Efficiency is \(E=\frac{T_{p}}{T_{p}+T_{c}}=1 /\left(1+\frac{t_{s} p \log p}{t_{c} n \log n}+\frac{t_{w} \log p}{t_{c} \log n}\right)\)
  • Define \(K\) such that \(E=1 /(1+1 / K) .\)
  • For isoefficiency, want last two terms in denominator to be constant.
  • \(\frac{t_{s} p \log p}{t_{c} n \log n}=\frac{1}{K}\) implies \(n \log n=W=K \frac{t_{s}}{t_{c}} p \log p .\)
  • \(\frac{t_{w} \log p}{t_{c} \log n}=\frac{1}{K}\) implies \(\log n=K_{t_{c}}^{t_{w}} \log p,\) so \(n=p^{K t_{w} / t_{c}},\) so
  • \(W=K \frac{t_{w}}{t_{c}} p^{K t_{w} / t_{c}} \log p\)

The efficiency for this case depends on \(t_c,t_s,t_w\), the wait time is the tradeoffs between two different lines. which is:

From the Efficiency law
, we have once \(\frac{Kt_w}{t_c}>1\), the increasing time is polynomial with regard to the processor count.

Mese:

2D transpose FFT: The best:

Reference

  1. Prof Fan’s PPT
  2. Implement FFT in C by GoKing
  3. Implement FFT on QPU rpi3b+ by Andrew Holme

Scala 逆变和协变

source:
trait List[+T]{}
当类型 B 是类型 A 的子类型时,则 List[B]也可以认为是 List[A}的子类型,即 List[B]可以泛化 为 List[A]。也就是被参数化类型的泛化方向与参数类型的方向是一致的,所以称为协变 (covariance)

逆变定义形式如:trait List[-T]{} 当类型 B 是类型 A 的子类型,

则 Queue[A]反过来可以认为是 Queue[B}的子类型。

也就是被 参数化类型的泛化方向与参数类型的方向是相反的,所以称为逆变(contravariance)

summary

scala 环境配置

export SCALA_HOME=/root/Downloads/spark-2.4.3-bin-hadoop2.7/
export HADOOP_HOME=/root/Downloads/hadoop-2.7.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=Master
SPARK_LOCAL_DIRS=/root/Downloads/spark-2.4.3-bin-hadoop2.7
SPARK_DRIVER_MEMORY=512M

[Parallel Computing] Architecture

implicit parallelism - pipelinig

break up one instruction and execute pieces

different unit of cpu utilization
instruction fetch -> decode -> exec -> mem read -> write back
at most 5x with 5 stage pipeline

modern processors has 10-20 stages

if code branches, must guess how to full pipelines.

  1. branch misprediction requires flushing pipline.
  2. Typical code every 5 instructions.
    so prediction is important

implicit parallelism - Superscalar

  1. we can have 2 fetch unit each clock

    ((i)has fewer data dependencies (iii)should be rescheduled.)
    e.g.
    1000 1004 1008 100C
    --o---o---o---o
    -- |---/ ---|---/
    ---R1 ----- R2
    ----|-----/
  2. execution must respect data dependencies.

    To deal with data race, Use DAG to label them.
    Look ahead using CA to reorder the instruction set.
    ## implicit parallelism - VLIW(compile time)

memory performance

latency & bandwidth
speed & lanes

memory hierarchy

caching

locality

Temporal locality - small set of data accessed repeatedly
spacial locality - nearby pieces of data accessed together

e.g.

Matrix Multiplications have both temporal locality and spatial locality.
IMG_FAC796E5B556-1

performance

cache hit rate is the proportion of data accesses serviced from the cache.
e.g.

processor architectures

  1. a multiprocessor contains several cores communicating through memory.
  2. a multicore has several cores on a single die.
    1. each core is a complete processor, but runs simultaneously.
    2. they share L2/L3 cache.
    3. cheap to replicate CPU state.(registers, program counter, stack) (SMT)

GPGPU

1000+ core stream simultaneously.

Flynn's taxonomy

  • SISD conventional sequential processor.
  • SIMD single control unit for multiple processing unit
  • MISD uncommon
  • MIMD Distributed systems

SIMD

  • control unit fetches single instruction and broadcast it to processing units.
  • PEs each apply instruction to its own data item in synchrony.
    • e.g. adding two arrays.
    • very popular today.
      • dgemm, graphics and video, ML
      • cheap to implement, don't duplicate hardware for fetch.
      • Used in GPU(SIMT), avx, phi
      • 100s bits lanes -> 4-32 lanes

Instruction divergence

SIMD processors can perform different instructions.

Interconnection networks

  • built using links and switches
    • links are fixed connection between 2 processors.
    • switch connect set of processors on input ports with processors on output ports
  • static vs. dynamic
  • quadratic needed

shared memory architecture

all processors shared the same mems-pace.
Pros: easy to write program
Cons: limited scalability.
UMA vs. NUMA

Distributed memory architecture


combination of shared memory and non-shared memory.

Network of workstation(NOW)

connect workstations with networking.
e.g. Beowulf

Bus architecture.

all processors use mem on the same bus.
the bandwidth is the bottlenech.

Crossbar architecture.

switched network for higher end shared memory system.
allow all processors to communicate with all mem modules simultaneously. -nonblocking
Higher bandwidth

topology

multihop networks.
The combination of the bus and cross bar. flexibility between bandwidth and scalability.
feature:

  • Diameter
    • Distance between farthest pair of processors.
    • Gives worst case latency.
  • Bisection width. - fault tolerence
    • Minimum number of links to cut to partition the network into 3 equal halves.
    • Indicates bottlenecks.
    • links out bandwidth.
  • cost
    • the number of links.

1d topologies


meshes and tori



hypercube

advantage

  1. good algorithm to calculate the path(only change the number of different bit)
  2. Bisection can always be \(\frac{P}{2}\) so there's little bottleneck
  3. nodes do not have hierarchy (v.s. fat tree)

[Computer Architecture] Numbers Notes

big IDEAs

  1. Abstraction
  2. Moore's Law
  3. Principle of Locality/Memory Hierarchy
  4. Parallelism
  5. Performance Measurement and Improvement
  6. Dependability VIA Redundancy

old conventional wisdom

Moore's Law t+ Dennard Scaling = faster, cheaper, low-power

signed &unsigned Intergers

unsigned

e.g. for unsigned int: adresses
0000 0000 0000 0001\(_{two}\) = \(1_{ten}\)
0111 1111 1111 1111\(_{two}\) = \(2^{11}-1\ _{ten}\)

signed

e.g. for signed int: int x,y,z;
1000 0000 0000 0000\(_{two}\) = \(-2^{11}\ _{ten}\)

main idea

want \(\frac{1}{2}\) of the int >=0, \(\frac{1}{2}\) of the int <0

two complement

basic ideas

for 3\(_{ten}\)=0011\(_{two}\)
for 10000\(_{two}\)-0011\(_{two}\)=1\(_{two}\)+1111\(_{two}\)-0011\(_{two}\)=1101\(_{two}\)

more e.g.

Assume for simplicity 4 bit width, -8 to +7
represented
PNG图像
There's an overflow here
Overflow when magnitude of result too big to fit into result representation
Carry in = carry form less significant bits
Carry out = carry to more significant bits

take care of the MSB(Most significant bits)
to detect overflow is to check whether carry in = carry out in the MSB

summary

test of vega sgemm

0.841471 0.540302 0 16384 16384 640 656 656 16400 0.6 0.6
9 3 0 512 512 256 512 256 512 1.4013e-44 2
0.841471 0.540302 0 512 512 256 272 272 528 1.4013e-44 2
cout<<A[1]<<" "<<B[1]<<" "<<C[1]<<" " <<m << " "<<n<<" "<<k<<" "<<lda<<" "<<ldb<<" "<<ldc<<" "<<alpha[1]<<" "<<beta[1]<<endl;

[Parallel Computing] Intro

Administrivia

Rui Fan
Leshan Wang

curent ddl

score complement

what and why

applications

  1. Fluid dynamics
  2. DNA & drug
  3. Quantum / atomic simulation, cosmological

challenges

  1. Harnessing power of masses
  2. Communication
    1. Processors compute faster than they can communicate.
    2. Problem gets worse as number of processors increase
    3. Main bottleneck to parallel computing
  3. Synchronization
  4. Scheduling
  5. Structured vs. unstructured
    1. Structured problems can be solved with custom hardware.
    2. Unstructured problems more general, but less efficient.
  6. Inherent linmitations
    1. Some problems are not( or don't seem to be ) paralleizable
      1. Dijkstra's shortest paths algorithm
    2. Other problems require clever algorithms to become parallel.
      1. Fibonacci series (\(a_{n}=a_{n-1}+a_{n-2}\))
    3. The human factor
      1. Hard to keep track of concurrent events and dependencies.
      2. Parallel algorithms are hard to design and debug.

Course Outline

  1. Parallel architectures.
    1. shared memory
    2. distributed memory
    3. many more
  2. Parallel languages
    1. OpenMP
    2. MPI
    3. cuda
    4. MapReduce
  3. Algorithm design techniques
    1. Decomposition
    2. Load balancing
    3. schedling
      ## state of the art
  4. parellel computers today mainly based on four processor architectures.
    1. Multi-cores
    2. Many-cores
    3. FPGA
    4. ASIC
  5. power efficiency: goal 50GFLOPS/W
  6. Top-500