[Parallel Computing] Architecture

implicit parallelism - pipelinig

break up one instruction and execute pieces

different unit of cpu utilization
instruction fetch -> decode -> exec -> mem read -> write back
at most 5x with 5 stage pipeline

modern processors has 10-20 stages

if code branches, must guess how to full pipelines.

  1. branch misprediction requires flushing pipline.
  2. Typical code every 5 instructions.
    so prediction is important

implicit parallelism - Superscalar

  1. we can have 2 fetch unit each clock

    ((i)has fewer data dependencies (iii)should be rescheduled.)
    e.g.
    1000 1004 1008 100C
    --o---o---o---o
    -- |---/ ---|---/
    ---R1 ----- R2
    ----|-----/
  2. execution must respect data dependencies.

    To deal with data race, Use DAG to label them.
    Look ahead using CA to reorder the instruction set.
    ## implicit parallelism - VLIW(compile time)

memory performance

latency & bandwidth
speed & lanes

memory hierarchy

caching

locality

Temporal locality - small set of data accessed repeatedly
spacial locality - nearby pieces of data accessed together

e.g.

Matrix Multiplications have both temporal locality and spatial locality.
IMG_FAC796E5B556-1

performance

cache hit rate is the proportion of data accesses serviced from the cache.
e.g.

processor architectures

  1. a multiprocessor contains several cores communicating through memory.
  2. a multicore has several cores on a single die.
    1. each core is a complete processor, but runs simultaneously.
    2. they share L2/L3 cache.
    3. cheap to replicate CPU state.(registers, program counter, stack) (SMT)

GPGPU

1000+ core stream simultaneously.

Flynn's taxonomy

  • SISD conventional sequential processor.
  • SIMD single control unit for multiple processing unit
  • MISD uncommon
  • MIMD Distributed systems

SIMD

  • control unit fetches single instruction and broadcast it to processing units.
  • PEs each apply instruction to its own data item in synchrony.
    • e.g. adding two arrays.
    • very popular today.
      • dgemm, graphics and video, ML
      • cheap to implement, don't duplicate hardware for fetch.
      • Used in GPU(SIMT), avx, phi
      • 100s bits lanes -> 4-32 lanes

Instruction divergence

SIMD processors can perform different instructions.

Interconnection networks

  • built using links and switches
    • links are fixed connection between 2 processors.
    • switch connect set of processors on input ports with processors on output ports
  • static vs. dynamic
  • quadratic needed

shared memory architecture

all processors shared the same mems-pace.
Pros: easy to write program
Cons: limited scalability.
UMA vs. NUMA

Distributed memory architecture


combination of shared memory and non-shared memory.

Network of workstation(NOW)

connect workstations with networking.
e.g. Beowulf

Bus architecture.

all processors use mem on the same bus.
the bandwidth is the bottlenech.

Crossbar architecture.

switched network for higher end shared memory system.
allow all processors to communicate with all mem modules simultaneously. -nonblocking
Higher bandwidth

topology

multihop networks.
The combination of the bus and cross bar. flexibility between bandwidth and scalability.
feature:

  • Diameter
    • Distance between farthest pair of processors.
    • Gives worst case latency.
  • Bisection width. - fault tolerence
    • Minimum number of links to cut to partition the network into 3 equal halves.
    • Indicates bottlenecks.
    • links out bandwidth.
  • cost
    • the number of links.

1d topologies


meshes and tori



hypercube

advantage

  1. good algorithm to calculate the path(only change the number of different bit)
  2. Bisection can always be \(\frac{P}{2}\) so there's little bottleneck
  3. nodes do not have hierarchy (v.s. fat tree)

[RL] Probability Review

Basics

The set

general definition of Probability

样本空间

IMG_B1B034C78691-1

概率的物理意义

frequentist view: a long-run frequency over a large number of repetitions of an experiment.

Bayesian view: a degree of belief about the event in question.
We can assign probabilities to hypotheses like "candidate will win the election" or "the defendant is guilty"can't be repeated.

Markov & Monta Carlo + computing power + algorithm thrives the Bayesian view.

role

IMG_F7558CBB4476-1

条件概率

所有事情都有条件,条件就会产生概率
e.g. Conditioning -> DIVIDE & CONCUER -> recursively apply to multi-stage problem.

P(A|B) = \(\frac{P(A\ and\ B)}{P(B)}\)

chain rules

有利于分布式计算

IMG_EC132FE4D2D1-1

Inference & Bayes' RulesIMG_F7558CBB4476-1

概率分布和极限定理

PDF 概率密度函数

混合型

IMG_434B41011BCA-1

PDF

valid PDF

  1. non negative \(f(x)\geq0\)
  2. integral to 1:
    \(\int^{\infty}_{-\infty}f(x)dx=1\)

probability distribution

summary of probability distribution


三种距离衡量 in ML, DL, AI

全变量距离

usually in GAN

小数定理(稀疏事件) in poisson


去食堂吃饭人数可以用柏松分布来描述

Sample mean

强大数定理SLLN


收敛到真正的概率值以概率为一收敛

弱大数定理WLLN


以概率收敛

中心极限定理

Generating function

  1. PGF - Z
  2. MGF - Laplace
  3. CF - 傅立叶

APPLICATION

  1. branching process
  2. bridge complex and probability
  3. play a role in large deviation theory
    ## Multi variables.
    joint distribution provides complete information about how multiple r.v. interact in high-dimensional space

joint CDF &PDF



marginal PMF

conditional PMF

joint PDF



Screen Shot 2020-03-03 at 03.04.48
Screen Shot 2020-03-03 at 03.29.53
Screen Shot 2020-03-03 at 03.31.50
Screen Shot 2020-03-03 at 03.31.59
Screen Shot 2020-03-03 at 03.32.11

techniques

general Bayes' Rules.

general LOTP

change of variables


summary

Order Statistics

CDF of order statistic

Screen Shot 2020-03-03 at 03.57.04

proof

PDF of Order Statostic


two methods to find PDF

  1. CDF -differentiate> PDF (ugly)
  2. PDF*dx
    ###proof

    ## joint PDF

e.g. order statistics of Uniforms

story:beta-Binomial Conjugacy

Screen Shot 2020-03-03 at 16.07.50

Mean vs Bayes'


deduction

e.g. 拉普拉斯问题

来自大名鼎鼎的拉普拉斯的问题,若给定太阳每天都升起的历史记录,则太阳明天仍然能升起的概率是多少?

拉普拉斯自己的解法:
假定太阳升起这一事件服从一个未知参数A的伯努利过程,且A是[0,1]内均匀分布,则利用已给定的历史数据,太阳明天能升起这一事件的后验概率为
\(P(Xn+1|Xn=1,Xn-1=1,...,X1=1)=\frac{P(Xn+1,Xn=1,Xn-1=1,...,X1=1)}{P(Xn=1,Xn-1=1,...,X1=1)}\)=An+1 在[0,1]内对A的积分/An 在[0,1]内对A的积分=\(\frac{n+1}{n+2}\),即已知太阳从第1天到第n天都能升起,第n+1天能升起的概率接近于1.

Monte carlo

importance sampling

reduce the 方差

importance sampling

example

[Computer Architecture] Numbers Notes

big IDEAs

  1. Abstraction
  2. Moore's Law
  3. Principle of Locality/Memory Hierarchy
  4. Parallelism
  5. Performance Measurement and Improvement
  6. Dependability VIA Redundancy

old conventional wisdom

Moore's Law t+ Dennard Scaling = faster, cheaper, low-power

signed &unsigned Intergers

unsigned

e.g. for unsigned int: adresses
0000 0000 0000 0001\(_{two}\) = \(1_{ten}\)
0111 1111 1111 1111\(_{two}\) = \(2^{11}-1\ _{ten}\)

signed

e.g. for signed int: int x,y,z;
1000 0000 0000 0000\(_{two}\) = \(-2^{11}\ _{ten}\)

main idea

want \(\frac{1}{2}\) of the int >=0, \(\frac{1}{2}\) of the int <0

two complement

basic ideas

for 3\(_{ten}\)=0011\(_{two}\)
for 10000\(_{two}\)-0011\(_{two}\)=1\(_{two}\)+1111\(_{two}\)-0011\(_{two}\)=1101\(_{two}\)

more e.g.

Assume for simplicity 4 bit width, -8 to +7
represented
PNG图像
There's an overflow here
Overflow when magnitude of result too big to fit into result representation
Carry in = carry form less significant bits
Carry out = carry to more significant bits

take care of the MSB(Most significant bits)
to detect overflow is to check whether carry in = carry out in the MSB

summary

test of vega sgemm

0.841471 0.540302 0 16384 16384 640 656 656 16400 0.6 0.6
9 3 0 512 512 256 512 256 512 1.4013e-44 2
0.841471 0.540302 0 512 512 256 272 272 528 1.4013e-44 2
cout<<A[1]<<" "<<B[1]<<" "<<C[1]<<" " <<m << " "<<n<<" "<<k<<" "<<lda<<" "<<ldb<<" "<<ldc<<" "<<alpha[1]<<" "<<beta[1]<<endl;

[Parallel Computing] Intro

Administrivia

Rui Fan
Leshan Wang

curent ddl

score complement

what and why

applications

  1. Fluid dynamics
  2. DNA & drug
  3. Quantum / atomic simulation, cosmological

challenges

  1. Harnessing power of masses
  2. Communication
    1. Processors compute faster than they can communicate.
    2. Problem gets worse as number of processors increase
    3. Main bottleneck to parallel computing
  3. Synchronization
  4. Scheduling
  5. Structured vs. unstructured
    1. Structured problems can be solved with custom hardware.
    2. Unstructured problems more general, but less efficient.
  6. Inherent linmitations
    1. Some problems are not( or don't seem to be ) paralleizable
      1. Dijkstra's shortest paths algorithm
    2. Other problems require clever algorithms to become parallel.
      1. Fibonacci series (\(a_{n}=a_{n-1}+a_{n-2}\))
    3. The human factor
      1. Hard to keep track of concurrent events and dependencies.
      2. Parallel algorithms are hard to design and debug.

Course Outline

  1. Parallel architectures.
    1. shared memory
    2. distributed memory
    3. many more
  2. Parallel languages
    1. OpenMP
    2. MPI
    3. cuda
    4. MapReduce
  3. Algorithm design techniques
    1. Decomposition
    2. Load balancing
    3. schedling
      ## state of the art
  4. parellel computers today mainly based on four processor architectures.
    1. Multi-cores
    2. Many-cores
    3. FPGA
    4. ASIC
  5. power efficiency: goal 50GFLOPS/W
  6. Top-500

Shanghaitech GeekPie 2020 WarmUP CTF Game - G20G Stage 3 Doc

Stage 3

3-1 失踪的快递

湖北省武汉市某同学家。从得到一个快递号,没来的去取,上面的身份证号已经模糊,需要猜出才能取得快递。

快递编号 3102511818424 → X

百度第一次认识 Google 是什么时候 → Y

身份证号

TL;DR - 目前的选择是 42011620060523627X。

前六位

前六位为行政区域编码。

湖北武汉黄陂区

考虑到能自己少做一些就少做一些,目前看到快递单号 3102511818424 还可以:

3102511818424湖北武汉黄陂区420116

这个单号的优点是 直接用百度搜索 会提示“:( 抱歉,查询出错,请重试或点击快递公司官网地址进行查询。”,但实际上到韵达快递官网是能找到的。

日期部分

需要一个带有 年 月 日 的 日期。

百度第一次认识 Google 是什么时候
百度搜 Google最早结果 2001年6月30日20010631
百度第一次删除 Google 是什么时候

可能需要一些线索提示要去“百度百科”找答案

Google 词条编辑历史最早删除 2006-05-2320060523
其他方法

举例:

  • 询问黑板报得到答案

最后四位

数字都给出来,顺序要你自己猜。选择校验码为 X,则可以使用 627X。

3-2 快递的秘密

Onion 域名

blgpxymqjoo35curmvsldxejuq5vsyf5orutrfp25bdan223t62a3vad.onion/42011620010631726X 跳转到一个教务系统的登陆界面。需要输入自己的邮箱和与之匹配的用户名登陆。之后跳转到

查看课件

ppt隐写

改ppt为zip解压出藏在最后的网址和线索

3-3 救人的药

ppt给出线索跳转到jupyter hub 的登陆界面

网址 victoryang00.xyz:5006 用户名 jupyter1-6 密码 g20

每运行一步能得到一部分分数

最后在前端输入所有可能的总共化合物的碳原子比氢原子的比值的最小值(保留13位有效数字)得到分数

此步骤有mutex 锁(显卡只有一块,而训练会allocate全部显存),即一组在运行时,另一组无法进行,也是拉大时间差距的一关

答案是 0.8235294117647

3-4 电话号码

承接上回给出的线索,以下线索按时间顺序(每10分钟)给出,要求得到电话号码

位置提示1

https://j.map.baidu.com/f3/fGw 最接近球状物体的地方

位置提示2

https://j.map.baidu.com/a2/C_w 和上一张图同时存在的地方

位置提示3

上海微小卫星工程中心

答案是 (021)50735001

[Parallel Computing] cuda 默认流&非默认流 (异步流和可视化)分析


并发 cuda stream
默认流具有更高的优先级,在QuEST中默认流为Statevec_compactUnitaryKernel 函数,同时占用的gpu时间和运算重点都在此核函数上。

给定流中的操作会按序执行。

  1. 就不同非默认流中的操作而言,无法保证其会按彼此之间的任何特定顺序执行。
  2. 默认流会受到阻碍,并在其他所有流完成之后方可运行,但其亦会阻碍其他流的运行直至其自身已运行完毕。

定义非默认流,把cudaStream_t stream; 作为参数传递,让编译器自动完成默认流的创建和复制。
cudaMallocManagedcudaMemPrefetchAsync

更详细的内存管理:

  1. cudaMalloc GPU分配内存
  2. cudaMallocHost 把内存分配在cpu上

第三个杀手锏 cudaMemcpyAsync

What does Multi-Armed Bandit means?

credit:https://iosband.github.io/2015/07/19/Efficient-experimentation-and-multi-armed-bandits.html

At first, multi-armed bandit means using
\(f^* : \mathcal{X} \rightarrow \mathbb{R}\)

  1. Each arm \(i\) pays out 1 dollar with probability \(p_i\) if it is played; otherwise it pays out nothing.
  2. While the \(p_1,…,p_k\) are fixed, we don’t know any of their values.
  3. Each timestep \(t\) we pick a single arm \(a_t\) to play.
  4. Based on our choice, we receive a return of \(r_t \sim Ber(p_{a_t})\).
  5. ##How should we choose arms so as to maximize total expected return?##

HPL result after mending the OpenIB

================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  178176   178176   178176
NB     :     384
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :    Left
NBMIN  :       2
NDIV   :       2
RFACT  :    Left
BCAST  :   2ring
DEPTH  :       0
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

trsm_cutoff from environment variable 9000000
gpu_dgemm_split from environment variable 1.000
check_cpu_dgemm_perf from environment variable 0

        ******** TESTING SYSTEM PARAMETERS ********
        PARAM   [UNITS]         MIN     MAX     AVG
        -----   -------         ---     ---     ---
CPU :
        CPU_BW  [GB/s ]         17.0    17.5    17.3
        CPU_FP  [GFLPS]
                NB =   32         30      51      43
                NB =   64         69      74      71
                NB =  128         78     101      94
                NB =  256         98     116     112
                NB =  512        114     125     122
PCIE (NVLINK on IBM) :
        H2D_BW  [GB/s ]         10.9    11.0    10.9
        D2H_BW  [GB/s ]         12.0    12.3    12.2
        BID_BW  [GB/s ]         16.8    17.5    17.1
CPU_BW concurrent with BID_BW :
        CPU_BW  [GB/s ]         9.3     10.3    9.9
        BID_BW  [GB/s ]         10.4    10.9    10.6
GPU :
        GPU_BW  [GB/s ]         768     774     772
        GPU_FP  [GFLPS]
                NB =  128       5456    5497    5479
                NB =  256       6312    6346    6335
                NB =  384       6635    6785    6729
                NB =  512       6146    6566    6385
                NB =  640       6255    6765    6529
                NB =  768       6178    6677    6463
                NB =  896       6296    6887    6601
                NB = 1024       6318    6760    6497
NET :
        PROC COL NET_BW [MB/s ]
                     8 B           9      10      10
                    64 B          71      82      76
                   512 B         374     425     399
                     4 KB       1660    1738    1698
                    32 KB       2562    2603    2578
                   256 KB       2551    2566    2558
                  2048 KB       2521    2686    2564
                 16384 KB       2543    2549    2545
        NET_LAT [ us  ]         2.7     3.3     3.0

        PROC ROW NET_BW [MB/s ]
                     8 B          26      29      27
                    64 B         176     185     181
                   512 B         810     867     839
                     4 KB       3487    3547    3517
                    32 KB       4715    4938    4827
                   256 KB       10310   10896   10603
                  2048 KB       3793    3812    3802
                 16384 KB       3643    3754    3699
        NET_LAT [ us  ]         0.6     0.9     0.7

displaying Prog:%complete, N:columns, Time:seconds
iGF:instantaneous GF, GF:avg GF, GF_per: process GF


Per-Process Host Memory Estimate: 32.16 GB (MAX) 32.16 GB (MIN)

PCOL: 0 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 1 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 3 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 2 GPU_COLS: 44545 CPU_COLS: 0
2020-02-16 03:09:22.935
 Prog= 1.93%    N_left= 177024  Time= 2.05      Time_left= 104.25       iGF= 35476.18   GF= 35476.18    iGF_per= 4434.52        GF_per= 4434.52
 Prog= 3.20%    N_left= 176256  Time= 3.03      Time_left= 91.75        iGF= 48776.89   GF= 39787.78    iGF_per= 6097.11        GF_per= 4973.47
 Prog= 4.46%    N_left= 175488  Time= 4.01      Time_left= 86.02        iGF= 48350.88   GF= 41884.18    iGF_per= 6043.86        GF_per= 5235.52
 Prog= 6.33%    N_left= 174336  Time= 5.52      Time_left= 81.68        iGF= 46884.70   GF= 43246.86    iGF_per= 5860.59        GF_per= 5405.86
 Prog= 7.56%    N_left= 173568  Time= 6.45      Time_left= 78.89        iGF= 49726.67   GF= 44185.60    iGF_per= 6215.83        GF_per= 5523.20
 Prog= 8.78%    N_left= 172800  Time= 7.47      Time_left= 77.63        iGF= 45087.65   GF= 44308.93    iGF_per= 5635.96        GF_per= 5538.62
 Prog= 10.59%   N_left= 171648  Time= 8.94      Time_left= 75.41        iGF= 46760.45   GF= 44709.92    iGF_per= 5845.06        GF_per= 5588.74
 Prog= 11.79%   N_left= 170880  Time= 9.85      Time_left= 73.67        iGF= 49499.57   GF= 45152.71    iGF_per= 6187.45        GF_per= 5644.09
 Prog= 12.97%   N_left= 170112  Time= 10.84     Time_left= 72.74        iGF= 44732.85   GF= 45114.06    iGF_per= 5591.61        GF_per= 5639.26
 Prog= 14.73%   N_left= 168960  Time= 12.20     Time_left= 70.60        iGF= 48990.39   GF= 45543.73    iGF_per= 6123.80        GF_per= 5692.97
 Prog= 15.89%   N_left= 168192  Time= 13.16     Time_left= 69.67        iGF= 45314.51   GF= 45526.95    iGF_per= 5664.31        GF_per= 5690.87
 Prog= 17.03%   N_left= 167424  Time= 14.06     Time_left= 68.48        iGF= 48046.22   GF= 45688.27    iGF_per= 6005.78        GF_per= 5711.03
 Prog= 18.73%   N_left= 166272  Time= 15.46     Time_left= 67.07        iGF= 45757.55   GF= 45694.55    iGF_per= 5719.69        GF_per= 5711.82
 Prog= 19.85%   N_left= 165504  Time= 16.43     Time_left= 66.33        iGF= 43464.21   GF= 45562.56    iGF_per= 5433.03        GF_per= 5695.32
 Prog= 20.97%   N_left= 164736  Time= 17.28     Time_left= 65.14        iGF= 49538.96   GF= 45757.11    iGF_per= 6192.37        GF_per= 5719.64
 Prog= 22.61%   N_left= 163584  Time= 18.67     Time_left= 63.88        iGF= 44748.96   GF= 45682.17    iGF_per= 5593.62        GF_per= 5710.27
 Prog= 23.70%   N_left= 162816  Time= 19.52     Time_left= 62.86        iGF= 47723.89   GF= 45771.82    iGF_per= 5965.49        GF_per= 5721.48
 Prog= 24.77%   N_left= 162048  Time= 20.46     Time_left= 62.13        iGF= 43350.60   GF= 45661.18    iGF_per= 5418.83        GF_per= 5707.65
 Prog= 26.36%   N_left= 160896  Time= 21.82     Time_left= 60.94        iGF= 44130.63   GF= 45565.69    iGF_per= 5516.33        GF_per= 5695.71
 Prog= 27.41%   N_left= 160128  Time= 22.62     Time_left= 59.90        iGF= 49145.98   GF= 45693.12    iGF_per= 6143.25        GF_per= 5711.64
 Prog= 28.45%   N_left= 159360  Time= 23.58     Time_left= 59.30        iGF= 40933.20   GF= 45499.84    iGF_per= 5116.65        GF_per= 5687.48
 Prog= 29.99%   N_left= 158208  Time= 24.78     Time_left= 57.83        iGF= 48574.61   GF= 45648.24    iGF_per= 6071.83        GF_per= 5706.03
 Prog= 31.01%   N_left= 157440  Time= 25.68     Time_left= 57.14        iGF= 42424.31   GF= 45535.02    iGF_per= 5303.04        GF_per= 5691.88
 Prog= 32.01%   N_left= 156672  Time= 26.47     Time_left= 56.22        iGF= 47689.89   GF= 45599.69    iGF_per= 5961.24        GF_per= 5699.96
 Prog= 33.50%   N_left= 155520  Time= 27.81     Time_left= 55.20        iGF= 42112.81   GF= 45432.53    iGF_per= 5264.10        GF_per= 5679.07
 Prog= 34.48%   N_left= 154752  Time= 28.66     Time_left= 54.45        iGF= 43461.57   GF= 45374.03    iGF_per= 5432.70        GF_per= 5671.75
 Prog= 35.45%   N_left= 153984  Time= 29.40     Time_left= 53.53        iGF= 49076.85   GF= 45467.95    iGF_per= 6134.61        GF_per= 5683.49
 Prog= 36.41%   N_left= 153216  Time= 30.33     Time_left= 52.97        iGF= 38942.69   GF= 45267.77    iGF_per= 4867.84        GF_per= 5658.47
 Prog= 37.84%   N_left= 152064  Time= 31.44     Time_left= 51.65        iGF= 48678.40   GF= 45387.41    iGF_per= 6084.80        GF_per= 5673.43
 Prog= 38.77%   N_left= 151296  Time= 32.32     Time_left= 51.03        iGF= 40203.99   GF= 45246.42    iGF_per= 5025.50        GF_per= 5655.80
 Prog= 39.70%   N_left= 150528  Time= 33.05     Time_left= 50.20        iGF= 47611.40   GF= 45299.00    iGF_per= 5951.43        GF_per= 5662.37
 Prog= 41.08%   N_left= 149376  Time= 34.29     Time_left= 49.19        iGF= 41721.17   GF= 45169.44    iGF_per= 5215.15        GF_per= 5646.18
 Prog= 41.98%   N_left= 148608  Time= 35.17     Time_left= 48.61        iGF= 38661.58   GF= 45006.27    iGF_per= 4832.70        GF_per= 5625.78
 Prog= 42.87%   N_left= 147840  Time= 35.86     Time_left= 47.78        iGF= 48952.92   GF= 45082.13    iGF_per= 6119.11        GF_per= 5635.27
 Prog= 44.20%   N_left= 146688  Time= 37.07     Time_left= 46.80        iGF= 41375.08   GF= 44961.37    iGF_per= 5171.89        GF_per= 5620.17
 Prog= 45.07%   N_left= 145920  Time= 37.79     Time_left= 46.06        iGF= 45648.36   GF= 44974.46    iGF_per= 5706.04        GF_per= 5621.81
 Prog= 45.93%   N_left= 145152  Time= 38.60     Time_left= 45.44        iGF= 40090.04   GF= 44871.78    iGF_per= 5011.25        GF_per= 5608.97
 Prog= 47.21%   N_left= 144000  Time= 39.82     Time_left= 44.52        iGF= 39552.58   GF= 44709.14    iGF_per= 4944.07        GF_per= 5588.64
 Prog= 48.05%   N_left= 143232  Time= 40.47     Time_left= 43.75        iGF= 49061.56   GF= 44778.59    iGF_per= 6132.69        GF_per= 5597.32
 Prog= 48.88%   N_left= 142464  Time= 41.31     Time_left= 43.20        iGF= 37191.35   GF= 44623.80    iGF_per= 4648.92        GF_per= 5577.98
 Prog= 50.11%   N_left= 141312  Time= 42.27     Time_left= 42.08        iGF= 48488.80   GF= 44711.28    iGF_per= 6061.10        GF_per= 5588.91
 Prog= 50.92%   N_left= 140544  Time= 43.06     Time_left= 41.50        iGF= 38233.96   GF= 44591.27    iGF_per= 4779.24        GF_per= 5573.91
 Prog= 51.72%   N_left= 139776  Time= 43.70     Time_left= 40.79        iGF= 47352.86   GF= 44631.53    iGF_per= 5919.11        GF_per= 5578.94
 Prog= 52.91%   N_left= 138624  Time= 44.84     Time_left= 39.92        iGF= 39078.20   GF= 44490.06    iGF_per= 4884.77        GF_per= 5561.26
 Prog= 53.68%   N_left= 137856  Time= 45.67     Time_left= 39.40        iGF= 35469.32   GF= 44326.60    iGF_per= 4433.67        GF_per= 5540.82
 Prog= 54.45%   N_left= 137088  Time= 46.27     Time_left= 38.70        iGF= 48835.00   GF= 44384.52    iGF_per= 6104.37        GF_per= 5548.07
 Prog= 55.59%   N_left= 135936  Time= 47.38     Time_left= 37.85        iGF= 38524.68   GF= 44246.68    iGF_per= 4815.59        GF_per= 5530.84
 Prog= 56.34%   N_left= 135168  Time= 47.98     Time_left= 37.18        iGF= 46924.35   GF= 44280.25    iGF_per= 5865.54        GF_per= 5535.03
 Prog= 57.08%   N_left= 134400  Time= 48.78     Time_left= 36.68        iGF= 34793.40   GF= 44124.28    iGF_per= 4349.17        GF_per= 5515.54
 Prog= 58.18%   N_left= 133248  Time= 49.88     Time_left= 35.86        iGF= 37458.72   GF= 43977.09    iGF_per= 4682.34        GF_per= 5497.14
 Prog= 58.89%   N_left= 132480  Time= 50.45     Time_left= 35.21        iGF= 48334.19   GF= 44025.55    iGF_per= 6041.77        GF_per= 5503.19
 Prog= 59.60%   N_left= 131712  Time= 51.24     Time_left= 34.73        iGF= 33628.72   GF= 43863.84    iGF_per= 4203.59        GF_per= 5482.98
 Prog= 60.31%   N_left= 130944  Time= 51.82     Time_left= 34.11        iGF= 45809.84   GF= 43885.56    iGF_per= 5726.23        GF_per= 5485.69
 Prog= 61.35%   N_left= 129792  Time= 52.86     Time_left= 33.31        iGF= 37783.38   GF= 43765.91    iGF_per= 4722.92        GF_per= 5470.74
 Prog= 62.03%   N_left= 129024  Time= 53.42     Time_left= 32.71        iGF= 45345.68   GF= 43782.68    iGF_per= 5668.21        GF_per= 5472.84
 Prog= 62.70%   N_left= 128256  Time= 54.15     Time_left= 32.21        iGF= 34954.61   GF= 43664.13    iGF_per= 4369.33        GF_per= 5458.02
 Prog= 63.70%   N_left= 127104  Time= 55.82     Time_left= 31.81        iGF= 22502.08   GF= 43031.33    iGF_per= 2812.76        GF_per= 5378.92
 Prog= 64.35%   N_left= 126336  Time= 56.67     Time_left= 31.39        iGF= 29211.49   GF= 42825.40    iGF_per= 3651.44        GF_per= 5353.18
 Prog= 65.00%   N_left= 125568  Time= 57.40     Time_left= 30.91        iGF= 33256.17   GF= 42703.25    iGF_per= 4157.02        GF_per= 5337.91
 Prog= 65.95%   N_left= 124416  Time= 58.21     Time_left= 30.05        iGF= 44190.67   GF= 42724.06    iGF_per= 5523.83        GF_per= 5340.51
 Prog= 66.58%   N_left= 123648  Time= 59.04     Time_left= 29.63        iGF= 28645.97   GF= 42527.36    iGF_per= 3580.75        GF_per= 5315.92
 Prog= 67.20%   N_left= 122880  Time= 59.54     Time_left= 29.07        iGF= 45998.47   GF= 42556.94    iGF_per= 5749.81        GF_per= 5319.62
 Prog= 68.11%   N_left= 121728  Time= 60.42     Time_left= 28.29        iGF= 39488.82   GF= 42512.62    iGF_per= 4936.10        GF_per= 5314.08
 Prog= 68.71%   N_left= 120960  Time= 61.07     Time_left= 27.81        iGF= 34586.02   GF= 42427.74    iGF_per= 4323.25        GF_per= 5303.47
 Prog= 69.30%   N_left= 120192  Time= 61.61     Time_left= 27.29        iGF= 41400.68   GF= 42418.75    iGF_per= 5175.09        GF_per= 5302.34
 Prog= 70.18%   N_left= 119040  Time= 62.86     Time_left= 26.71        iGF= 26402.30   GF= 42100.61    iGF_per= 3300.29        GF_per= 5262.58
 Prog= 70.75%   N_left= 118272  Time= 63.34     Time_left= 26.18        iGF= 45075.96   GF= 42123.15    iGF_per= 5634.50        GF_per= 5265.39
 Prog= 71.32%   N_left= 117504  Time= 64.14     Time_left= 25.80        iGF= 26608.65   GF= 41929.10    iGF_per= 3326.08        GF_per= 5241.14
 Prog= 72.15%   N_left= 116352  Time= 64.94     Time_left= 25.06        iGF= 39254.82   GF= 41896.06    iGF_per= 4906.85        GF_per= 5237.01
 Prog= 72.70%   N_left= 115584  Time= 65.45     Time_left= 24.57        iGF= 41171.51   GF= 41890.50    iGF_per= 5146.44        GF_per= 5236.31
 Prog= 73.24%   N_left= 114816  Time= 66.06     Time_left= 24.13        iGF= 33393.87   GF= 41811.98    iGF_per= 4174.23        GF_per= 5226.50
 Prog= 74.04%   N_left= 113664  Time= 66.79     Time_left= 23.42        iGF= 41230.05   GF= 41805.63    iGF_per= 5153.76        GF_per= 5225.70
 Prog= 74.56%   N_left= 112896  Time= 67.39     Time_left= 22.99        iGF= 32345.48   GF= 41720.09    iGF_per= 4043.19        GF_per= 5215.01
 Prog= 75.08%   N_left= 112128  Time= 67.91     Time_left= 22.54        iGF= 37707.11   GF= 41689.62    iGF_per= 4713.39        GF_per= 5211.20
 Prog= 75.84%   N_left= 110976  Time= 68.77     Time_left= 21.91        iGF= 33207.24   GF= 41583.13    iGF_per= 4150.90        GF_per= 5197.89
 Prog= 76.34%   N_left= 110208  Time= 69.40     Time_left= 21.51        iGF= 30057.75   GF= 41479.33    iGF_per= 3757.22        GF_per= 5184.92
 Prog= 76.83%   N_left= 109440  Time= 69.84     Time_left= 21.06        iGF= 42264.53   GF= 41484.26    iGF_per= 5283.07        GF_per= 5185.53
 Prog= 77.31%   N_left= 108672  Time= 71.08     Time_left= 20.86        iGF= 14652.32   GF= 41013.65    iGF_per= 1831.54        GF_per= 5126.71
 Prog= 78.03%   N_left= 107520  Time= 71.99     Time_left= 20.27        iGF= 29812.22   GF= 40873.13    iGF_per= 3726.53        GF_per= 5109.14
 Prog= 78.49%   N_left= 106752  Time= 72.71     Time_left= 19.92        iGF= 24299.69   GF= 40707.76    iGF_per= 3037.46        GF_per= 5088.47
 Prog= 78.95%   N_left= 105984  Time= 73.13     Time_left= 19.49        iGF= 41940.69   GF= 40714.74    iGF_per= 5242.59        GF_per= 5089.34
 Prog= 79.63%   N_left= 104832  Time= 73.95     Time_left= 18.91        iGF= 31090.48   GF= 40607.58    iGF_per= 3886.31        GF_per= 5075.95
 Prog= 80.08%   N_left= 104064  Time= 74.53     Time_left= 18.54        iGF= 28726.10   GF= 40514.59    iGF_per= 3590.76        GF_per= 5064.32
 Prog= 80.51%   N_left= 103296  Time= 74.98     Time_left= 18.15        iGF= 37150.07   GF= 40494.65    iGF_per= 4643.76        GF_per= 5061.83
 Prog= 81.16%   N_left= 102144  Time= 75.72     Time_left= 17.58        iGF= 32584.68   GF= 40416.72    iGF_per= 4073.08        GF_per= 5052.09
 Prog= 81.58%   N_left= 101376  Time= 76.20     Time_left= 17.20        iGF= 33444.77   GF= 40373.20    iGF_per= 4180.60        GF_per= 5046.65
 Prog= 82.00%   N_left= 100608  Time= 77.02     Time_left= 16.91        iGF= 19037.34   GF= 40145.25    iGF_per= 2379.67        GF_per= 5018.16
 Prog= 82.61%   N_left= 99456   Time= 77.77     Time_left= 16.37        iGF= 31036.06   GF= 40058.23    iGF_per= 3879.51        GF_per= 5007.28
 Prog= 83.01%   N_left= 98688   Time= 78.28     Time_left= 16.02        iGF= 29557.06   GF= 39989.80    iGF_per= 3694.63        GF_per= 4998.73
 Prog= 83.40%   N_left= 97920   Time= 78.75     Time_left= 15.67        iGF= 31521.92   GF= 39939.16    iGF_per= 3940.24        GF_per= 4992.40
 Prog= 83.98%   N_left= 96768   Time= 79.38     Time_left= 15.14        iGF= 34413.02   GF= 39895.00    iGF_per= 4301.63        GF_per= 4986.87
 Prog= 84.36%   N_left= 96000   Time= 79.89     Time_left= 14.81        iGF= 27780.50   GF= 39817.11    iGF_per= 3472.56        GF_per= 4977.14
 Prog= 84.73%   N_left= 95232   Time= 80.34     Time_left= 14.48        iGF= 31196.37   GF= 39768.82    iGF_per= 3899.55        GF_per= 4971.10
 Prog= 85.28%   N_left= 94080   Time= 81.14     Time_left= 14.01        iGF= 25847.14   GF= 39631.79    iGF_per= 3230.89        GF_per= 4953.97
 Prog= 85.64%   N_left= 93312   Time= 81.75     Time_left= 13.71        iGF= 22376.33   GF= 39504.58    iGF_per= 2797.04        GF_per= 4938.07
 Prog= 85.99%   N_left= 92544   Time= 82.14     Time_left= 13.38        iGF= 33941.78   GF= 39478.11    iGF_per= 4242.72        GF_per= 4934.76
 Prog= 86.50%   N_left= 91392   Time= 82.88     Time_left= 12.93        iGF= 26158.93   GF= 39358.40    iGF_per= 3269.87        GF_per= 4919.80
 Prog= 86.84%   N_left= 90624   Time= 83.22     Time_left= 12.61        iGF= 37628.00   GF= 39351.37    iGF_per= 4703.50        GF_per= 4918.92
 Prog= 87.17%   N_left= 89856   Time= 83.72     Time_left= 12.32        iGF= 24988.33   GF= 39265.49    iGF_per= 3123.54        GF_per= 4908.19
 Prog= 88.60%   N_left= 86400   Time= 85.50     Time_left= 11.00        iGF= 30110.08   GF= 39074.56    iGF_per= 3763.76        GF_per= 4884.32
 Prog= 90.05%   N_left= 82560   Time= 88.07     Time_left= 9.73 iGF= 21336.49   GF= 38557.09    iGF_per= 2667.06        GF_per= 4819.64
 Prog= 91.25%   N_left= 79104   Time= 90.30     Time_left= 8.66 iGF= 20259.09   GF= 38105.32    iGF_per= 2532.39        GF_per= 4763.17
 Prog= 92.35%   N_left= 75648   Time= 92.03     Time_left= 7.63 iGF= 23894.57   GF= 37837.86    iGF_per= 2986.82        GF_per= 4729.73
 Prog= 93.45%   N_left= 71808   Time= 94.32     Time_left= 6.61 iGF= 18238.09   GF= 37362.12    iGF_per= 2279.76        GF_per= 4670.26
 Prog= 94.35%   N_left= 68352   Time= 96.12     Time_left= 5.75 iGF= 18874.93   GF= 37016.15    iGF_per= 2359.37        GF_per= 4627.02
 Prog= 95.17%   N_left= 64896   Time= 97.55     Time_left= 4.95 iGF= 21433.84   GF= 36787.46    iGF_per= 2679.23        GF_per= 4598.43
 Prog= 95.90%   N_left= 61440   Time= 99.00     Time_left= 4.23 iGF= 19137.51   GF= 36530.45    iGF_per= 2392.19        GF_per= 4566.31
 Prog= 96.62%   N_left= 57600   Time= 100.51    Time_left= 3.51 iGF= 18001.20   GF= 36251.72    iGF_per= 2250.15        GF_per= 4531.46
 Prog= 97.19%   N_left= 54144   Time= 101.67    Time_left= 2.94 iGF= 18516.54   GF= 36048.39    iGF_per= 2314.57        GF_per= 4506.05
 Prog= 99.14%   N_left= 36480   Time= 106.77    Time_left= 0.92 iGF= 14414.00   GF= 35015.81    iGF_per= 1801.75        GF_per= 4376.98
 Prog= 99.89%   N_left= 18432   Time= 110.04    Time_left= 0.12 iGF=  8624.06   GF= 34231.83    iGF_per= 1078.01        GF_per= 4278.98
 Prog= 100.00%  N_left= 768     Time= 111.76    Time_left= 0.00 iGF=  2427.21   GF= 33742.39    iGF_per= 303.40         GF_per= 4217.80
2020-02-16 03:11:15.757
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02L2L2      178176   384     2     4             112.82              3.342e+04
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0031046 ...... PASSED