On database of Go and Kubernetes and Rust

A few days during the long vacation in China, I found some stuff good to play. The intensive was to figure out a fast framework for my ugly blog, which turns out to be non-sense. But the process of figuring out them is so so interesting.

My demands

My need is to have gitment embeddings work so that the trash comments will be filtered. Those can be handwritten by myself, but I don’t currently have time to do systematic research on javascript, only applying the latest wheel is enough for me. Besides, I’m in great need of markdown writing experience, so gatsby, Hexo and Hugo are my choices.

Rust

First, I consult on some rust written blog, with which I provide a great amount of efforts. https://github.com/ramsayleung/blog was a fantastic one. I found it utilize diesel to make mapping by struct in rust like:

table! {
    post (id) {
        id -> Int4,
        title -> Varchar,
        subtitle -> Varchar,
        raw_content -> Text,
        rendered_content -> Text,
        create_time -> Timestamp,
        modify_time -> Timestamp,
        post_type -> Int4,
        hit_time -> Int4,
        published -> Bool,
        slug_url -> Varchar,
        enable_comment -> Bool,
        tag -> Jsonb,
    }
}

table! {
    user (id) {
        id -> Int4,
        username -> Varchar,
        hashed_password -> Varchar,
        create_time -> Timestamp,
        modify_time -> Timestamp,
        email -> Varchar,
        avatar_url -> Nullable<Varchar>,
    }
}

table! {
    visitor_log (id) {
        id -> Int4,
        ip -> Inet,
        access_time -> Timestamp,
        user_id -> Int4,
    }
}

allow_tables_to_appear_in_same_query!(
    post,
    user,
    visitor_log,
);

I consult on the database of our schools'. I thought it was pgdb, which I guess right. In terms of the static website for blogging, the database seems laggy and out of date. I eventually found out that even the database API is written in rust, but the speed of calling prosgredb is not that fast, within 10s ms per outer key consultation. The web parts is pure js, but not wasm. The rust part only account for its request logic dealing with backend, still the js to get data from the database, which is sad.

Rust is still not a frontend ready language, although it claims to be a fast and high throughput language in terms of dealing with data. Although they have https://github.com/SASUKE40/yew-starter for wasm but still javascript, so why not just javascript?

nearly all the data storing in the language that utilize API use mapping

For example JSON

HUGO

Hugo is written in go. At the jump, I have some experience of dealing with time serialized data (LSM) of HPC data using go API. go is really an out-of-box language so you don’t care much about the memory leakage and semaphore stuff for multithreading programs. Because many of the company is utilizing the language, there’s a bunch of resources and society for CRUD and business code, from database to HTTP sever, from JSON to YAML. HUGO is just another part of it. I gain much information from the blog there https://draveness.me/few-words-time-management/.

Gatsby

React implementation, React components required. I’m not so familiar with javascript and only had one project with LEAFERX, a nice guy. I eventually turn back to php using wordpress.

Why Rust is not ready and Go is ready.

Inside the choice of blog, I talked about the Rust right now is porting everything out of its good & safe logic of itself. The scheme of the rust deisey is just dumby. Rust is not ready for high throughput program unless it has better package for native web deployment. Go is ready for it has its own coroutine, c++2a is catch up with it later on. But go is combining the java developer to make it has c++ speed with single lines. Like Drogan/Drongan.

http package of go

The net/http language of Go wraps both the HTTP client and server implementations, in order to support better scalability, it introduces the net/http. An interface to the HTTP request, where the caller takes the request as an argument to get a response to the request, and net/http. Handler is mainly used by the HTTP server to respond to client requests.

scheduler of go

Signal-Based Preemptor Dispatcher - 1.14 ~ so far

  1. Enabling signal-based true preemption dispatch.
    Garbage collection triggers preemption scheduling when the stack is scanned.
  2. Not enough time points have been seized to cover the full range of edge cases.
    static void schedule(G *gp) {
    schedlock();
    if(gp != nil) {
        gp->m = nil;
        uint32 v = runtime·xadd(&runtime·sched.atomic, -1<<mcpuShift);
        if(atomic_mcpu(v) > maxgomaxprocs)
            runtime·throw("negative mcpu in scheduler");
        switch(gp->status){
        case Grunning:
            gp->status = Grunnable;
            gput(gp);
            break;
        case ...:
        }
    } else {
        ...
    }
    gp = nextgandunlock();
    gp->status = Grunning;
    m->curg = gp;
    gp->m = m;
    runtime·gogo(&gp->sched, 0);
    }
    

How overlay network is written in go.

Overlay networking is not actually a new technology, it is a computer network built on another network, a form of network virtualization technology that has been facilitated by the evolution of cloud virtualization technology in recent years.

In practice, we typically use Virtual Extensible LAN (VxLAN) to set up an Overlay network. In the following diagram, two physical machines can access each other over a three-layer IP network.

Reference

  1. https://draveness.me/whys-the-design-overlay-network/
  2. Kubernetes 源码剖析

HPL result after mending the OpenIB

================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  178176   178176   178176
NB     :     384
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :    Left
NBMIN  :       2
NDIV   :       2
RFACT  :    Left
BCAST  :   2ring
DEPTH  :       0
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

trsm_cutoff from environment variable 9000000
gpu_dgemm_split from environment variable 1.000
check_cpu_dgemm_perf from environment variable 0

        ******** TESTING SYSTEM PARAMETERS ********
        PARAM   [UNITS]         MIN     MAX     AVG
        -----   -------         ---     ---     ---
CPU :
        CPU_BW  [GB/s ]         17.0    17.5    17.3
        CPU_FP  [GFLPS]
                NB =   32         30      51      43
                NB =   64         69      74      71
                NB =  128         78     101      94
                NB =  256         98     116     112
                NB =  512        114     125     122
PCIE (NVLINK on IBM) :
        H2D_BW  [GB/s ]         10.9    11.0    10.9
        D2H_BW  [GB/s ]         12.0    12.3    12.2
        BID_BW  [GB/s ]         16.8    17.5    17.1
CPU_BW concurrent with BID_BW :
        CPU_BW  [GB/s ]         9.3     10.3    9.9
        BID_BW  [GB/s ]         10.4    10.9    10.6
GPU :
        GPU_BW  [GB/s ]         768     774     772
        GPU_FP  [GFLPS]
                NB =  128       5456    5497    5479
                NB =  256       6312    6346    6335
                NB =  384       6635    6785    6729
                NB =  512       6146    6566    6385
                NB =  640       6255    6765    6529
                NB =  768       6178    6677    6463
                NB =  896       6296    6887    6601
                NB = 1024       6318    6760    6497
NET :
        PROC COL NET_BW [MB/s ]
                     8 B           9      10      10
                    64 B          71      82      76
                   512 B         374     425     399
                     4 KB       1660    1738    1698
                    32 KB       2562    2603    2578
                   256 KB       2551    2566    2558
                  2048 KB       2521    2686    2564
                 16384 KB       2543    2549    2545
        NET_LAT [ us  ]         2.7     3.3     3.0

        PROC ROW NET_BW [MB/s ]
                     8 B          26      29      27
                    64 B         176     185     181
                   512 B         810     867     839
                     4 KB       3487    3547    3517
                    32 KB       4715    4938    4827
                   256 KB       10310   10896   10603
                  2048 KB       3793    3812    3802
                 16384 KB       3643    3754    3699
        NET_LAT [ us  ]         0.6     0.9     0.7

displaying Prog:%complete, N:columns, Time:seconds
iGF:instantaneous GF, GF:avg GF, GF_per: process GF


Per-Process Host Memory Estimate: 32.16 GB (MAX) 32.16 GB (MIN)

PCOL: 0 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 1 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 3 GPU_COLS: 44545 CPU_COLS: 0
PCOL: 2 GPU_COLS: 44545 CPU_COLS: 0
2020-02-16 03:09:22.935
 Prog= 1.93%    N_left= 177024  Time= 2.05      Time_left= 104.25       iGF= 35476.18   GF= 35476.18    iGF_per= 4434.52        GF_per= 4434.52
 Prog= 3.20%    N_left= 176256  Time= 3.03      Time_left= 91.75        iGF= 48776.89   GF= 39787.78    iGF_per= 6097.11        GF_per= 4973.47
 Prog= 4.46%    N_left= 175488  Time= 4.01      Time_left= 86.02        iGF= 48350.88   GF= 41884.18    iGF_per= 6043.86        GF_per= 5235.52
 Prog= 6.33%    N_left= 174336  Time= 5.52      Time_left= 81.68        iGF= 46884.70   GF= 43246.86    iGF_per= 5860.59        GF_per= 5405.86
 Prog= 7.56%    N_left= 173568  Time= 6.45      Time_left= 78.89        iGF= 49726.67   GF= 44185.60    iGF_per= 6215.83        GF_per= 5523.20
 Prog= 8.78%    N_left= 172800  Time= 7.47      Time_left= 77.63        iGF= 45087.65   GF= 44308.93    iGF_per= 5635.96        GF_per= 5538.62
 Prog= 10.59%   N_left= 171648  Time= 8.94      Time_left= 75.41        iGF= 46760.45   GF= 44709.92    iGF_per= 5845.06        GF_per= 5588.74
 Prog= 11.79%   N_left= 170880  Time= 9.85      Time_left= 73.67        iGF= 49499.57   GF= 45152.71    iGF_per= 6187.45        GF_per= 5644.09
 Prog= 12.97%   N_left= 170112  Time= 10.84     Time_left= 72.74        iGF= 44732.85   GF= 45114.06    iGF_per= 5591.61        GF_per= 5639.26
 Prog= 14.73%   N_left= 168960  Time= 12.20     Time_left= 70.60        iGF= 48990.39   GF= 45543.73    iGF_per= 6123.80        GF_per= 5692.97
 Prog= 15.89%   N_left= 168192  Time= 13.16     Time_left= 69.67        iGF= 45314.51   GF= 45526.95    iGF_per= 5664.31        GF_per= 5690.87
 Prog= 17.03%   N_left= 167424  Time= 14.06     Time_left= 68.48        iGF= 48046.22   GF= 45688.27    iGF_per= 6005.78        GF_per= 5711.03
 Prog= 18.73%   N_left= 166272  Time= 15.46     Time_left= 67.07        iGF= 45757.55   GF= 45694.55    iGF_per= 5719.69        GF_per= 5711.82
 Prog= 19.85%   N_left= 165504  Time= 16.43     Time_left= 66.33        iGF= 43464.21   GF= 45562.56    iGF_per= 5433.03        GF_per= 5695.32
 Prog= 20.97%   N_left= 164736  Time= 17.28     Time_left= 65.14        iGF= 49538.96   GF= 45757.11    iGF_per= 6192.37        GF_per= 5719.64
 Prog= 22.61%   N_left= 163584  Time= 18.67     Time_left= 63.88        iGF= 44748.96   GF= 45682.17    iGF_per= 5593.62        GF_per= 5710.27
 Prog= 23.70%   N_left= 162816  Time= 19.52     Time_left= 62.86        iGF= 47723.89   GF= 45771.82    iGF_per= 5965.49        GF_per= 5721.48
 Prog= 24.77%   N_left= 162048  Time= 20.46     Time_left= 62.13        iGF= 43350.60   GF= 45661.18    iGF_per= 5418.83        GF_per= 5707.65
 Prog= 26.36%   N_left= 160896  Time= 21.82     Time_left= 60.94        iGF= 44130.63   GF= 45565.69    iGF_per= 5516.33        GF_per= 5695.71
 Prog= 27.41%   N_left= 160128  Time= 22.62     Time_left= 59.90        iGF= 49145.98   GF= 45693.12    iGF_per= 6143.25        GF_per= 5711.64
 Prog= 28.45%   N_left= 159360  Time= 23.58     Time_left= 59.30        iGF= 40933.20   GF= 45499.84    iGF_per= 5116.65        GF_per= 5687.48
 Prog= 29.99%   N_left= 158208  Time= 24.78     Time_left= 57.83        iGF= 48574.61   GF= 45648.24    iGF_per= 6071.83        GF_per= 5706.03
 Prog= 31.01%   N_left= 157440  Time= 25.68     Time_left= 57.14        iGF= 42424.31   GF= 45535.02    iGF_per= 5303.04        GF_per= 5691.88
 Prog= 32.01%   N_left= 156672  Time= 26.47     Time_left= 56.22        iGF= 47689.89   GF= 45599.69    iGF_per= 5961.24        GF_per= 5699.96
 Prog= 33.50%   N_left= 155520  Time= 27.81     Time_left= 55.20        iGF= 42112.81   GF= 45432.53    iGF_per= 5264.10        GF_per= 5679.07
 Prog= 34.48%   N_left= 154752  Time= 28.66     Time_left= 54.45        iGF= 43461.57   GF= 45374.03    iGF_per= 5432.70        GF_per= 5671.75
 Prog= 35.45%   N_left= 153984  Time= 29.40     Time_left= 53.53        iGF= 49076.85   GF= 45467.95    iGF_per= 6134.61        GF_per= 5683.49
 Prog= 36.41%   N_left= 153216  Time= 30.33     Time_left= 52.97        iGF= 38942.69   GF= 45267.77    iGF_per= 4867.84        GF_per= 5658.47
 Prog= 37.84%   N_left= 152064  Time= 31.44     Time_left= 51.65        iGF= 48678.40   GF= 45387.41    iGF_per= 6084.80        GF_per= 5673.43
 Prog= 38.77%   N_left= 151296  Time= 32.32     Time_left= 51.03        iGF= 40203.99   GF= 45246.42    iGF_per= 5025.50        GF_per= 5655.80
 Prog= 39.70%   N_left= 150528  Time= 33.05     Time_left= 50.20        iGF= 47611.40   GF= 45299.00    iGF_per= 5951.43        GF_per= 5662.37
 Prog= 41.08%   N_left= 149376  Time= 34.29     Time_left= 49.19        iGF= 41721.17   GF= 45169.44    iGF_per= 5215.15        GF_per= 5646.18
 Prog= 41.98%   N_left= 148608  Time= 35.17     Time_left= 48.61        iGF= 38661.58   GF= 45006.27    iGF_per= 4832.70        GF_per= 5625.78
 Prog= 42.87%   N_left= 147840  Time= 35.86     Time_left= 47.78        iGF= 48952.92   GF= 45082.13    iGF_per= 6119.11        GF_per= 5635.27
 Prog= 44.20%   N_left= 146688  Time= 37.07     Time_left= 46.80        iGF= 41375.08   GF= 44961.37    iGF_per= 5171.89        GF_per= 5620.17
 Prog= 45.07%   N_left= 145920  Time= 37.79     Time_left= 46.06        iGF= 45648.36   GF= 44974.46    iGF_per= 5706.04        GF_per= 5621.81
 Prog= 45.93%   N_left= 145152  Time= 38.60     Time_left= 45.44        iGF= 40090.04   GF= 44871.78    iGF_per= 5011.25        GF_per= 5608.97
 Prog= 47.21%   N_left= 144000  Time= 39.82     Time_left= 44.52        iGF= 39552.58   GF= 44709.14    iGF_per= 4944.07        GF_per= 5588.64
 Prog= 48.05%   N_left= 143232  Time= 40.47     Time_left= 43.75        iGF= 49061.56   GF= 44778.59    iGF_per= 6132.69        GF_per= 5597.32
 Prog= 48.88%   N_left= 142464  Time= 41.31     Time_left= 43.20        iGF= 37191.35   GF= 44623.80    iGF_per= 4648.92        GF_per= 5577.98
 Prog= 50.11%   N_left= 141312  Time= 42.27     Time_left= 42.08        iGF= 48488.80   GF= 44711.28    iGF_per= 6061.10        GF_per= 5588.91
 Prog= 50.92%   N_left= 140544  Time= 43.06     Time_left= 41.50        iGF= 38233.96   GF= 44591.27    iGF_per= 4779.24        GF_per= 5573.91
 Prog= 51.72%   N_left= 139776  Time= 43.70     Time_left= 40.79        iGF= 47352.86   GF= 44631.53    iGF_per= 5919.11        GF_per= 5578.94
 Prog= 52.91%   N_left= 138624  Time= 44.84     Time_left= 39.92        iGF= 39078.20   GF= 44490.06    iGF_per= 4884.77        GF_per= 5561.26
 Prog= 53.68%   N_left= 137856  Time= 45.67     Time_left= 39.40        iGF= 35469.32   GF= 44326.60    iGF_per= 4433.67        GF_per= 5540.82
 Prog= 54.45%   N_left= 137088  Time= 46.27     Time_left= 38.70        iGF= 48835.00   GF= 44384.52    iGF_per= 6104.37        GF_per= 5548.07
 Prog= 55.59%   N_left= 135936  Time= 47.38     Time_left= 37.85        iGF= 38524.68   GF= 44246.68    iGF_per= 4815.59        GF_per= 5530.84
 Prog= 56.34%   N_left= 135168  Time= 47.98     Time_left= 37.18        iGF= 46924.35   GF= 44280.25    iGF_per= 5865.54        GF_per= 5535.03
 Prog= 57.08%   N_left= 134400  Time= 48.78     Time_left= 36.68        iGF= 34793.40   GF= 44124.28    iGF_per= 4349.17        GF_per= 5515.54
 Prog= 58.18%   N_left= 133248  Time= 49.88     Time_left= 35.86        iGF= 37458.72   GF= 43977.09    iGF_per= 4682.34        GF_per= 5497.14
 Prog= 58.89%   N_left= 132480  Time= 50.45     Time_left= 35.21        iGF= 48334.19   GF= 44025.55    iGF_per= 6041.77        GF_per= 5503.19
 Prog= 59.60%   N_left= 131712  Time= 51.24     Time_left= 34.73        iGF= 33628.72   GF= 43863.84    iGF_per= 4203.59        GF_per= 5482.98
 Prog= 60.31%   N_left= 130944  Time= 51.82     Time_left= 34.11        iGF= 45809.84   GF= 43885.56    iGF_per= 5726.23        GF_per= 5485.69
 Prog= 61.35%   N_left= 129792  Time= 52.86     Time_left= 33.31        iGF= 37783.38   GF= 43765.91    iGF_per= 4722.92        GF_per= 5470.74
 Prog= 62.03%   N_left= 129024  Time= 53.42     Time_left= 32.71        iGF= 45345.68   GF= 43782.68    iGF_per= 5668.21        GF_per= 5472.84
 Prog= 62.70%   N_left= 128256  Time= 54.15     Time_left= 32.21        iGF= 34954.61   GF= 43664.13    iGF_per= 4369.33        GF_per= 5458.02
 Prog= 63.70%   N_left= 127104  Time= 55.82     Time_left= 31.81        iGF= 22502.08   GF= 43031.33    iGF_per= 2812.76        GF_per= 5378.92
 Prog= 64.35%   N_left= 126336  Time= 56.67     Time_left= 31.39        iGF= 29211.49   GF= 42825.40    iGF_per= 3651.44        GF_per= 5353.18
 Prog= 65.00%   N_left= 125568  Time= 57.40     Time_left= 30.91        iGF= 33256.17   GF= 42703.25    iGF_per= 4157.02        GF_per= 5337.91
 Prog= 65.95%   N_left= 124416  Time= 58.21     Time_left= 30.05        iGF= 44190.67   GF= 42724.06    iGF_per= 5523.83        GF_per= 5340.51
 Prog= 66.58%   N_left= 123648  Time= 59.04     Time_left= 29.63        iGF= 28645.97   GF= 42527.36    iGF_per= 3580.75        GF_per= 5315.92
 Prog= 67.20%   N_left= 122880  Time= 59.54     Time_left= 29.07        iGF= 45998.47   GF= 42556.94    iGF_per= 5749.81        GF_per= 5319.62
 Prog= 68.11%   N_left= 121728  Time= 60.42     Time_left= 28.29        iGF= 39488.82   GF= 42512.62    iGF_per= 4936.10        GF_per= 5314.08
 Prog= 68.71%   N_left= 120960  Time= 61.07     Time_left= 27.81        iGF= 34586.02   GF= 42427.74    iGF_per= 4323.25        GF_per= 5303.47
 Prog= 69.30%   N_left= 120192  Time= 61.61     Time_left= 27.29        iGF= 41400.68   GF= 42418.75    iGF_per= 5175.09        GF_per= 5302.34
 Prog= 70.18%   N_left= 119040  Time= 62.86     Time_left= 26.71        iGF= 26402.30   GF= 42100.61    iGF_per= 3300.29        GF_per= 5262.58
 Prog= 70.75%   N_left= 118272  Time= 63.34     Time_left= 26.18        iGF= 45075.96   GF= 42123.15    iGF_per= 5634.50        GF_per= 5265.39
 Prog= 71.32%   N_left= 117504  Time= 64.14     Time_left= 25.80        iGF= 26608.65   GF= 41929.10    iGF_per= 3326.08        GF_per= 5241.14
 Prog= 72.15%   N_left= 116352  Time= 64.94     Time_left= 25.06        iGF= 39254.82   GF= 41896.06    iGF_per= 4906.85        GF_per= 5237.01
 Prog= 72.70%   N_left= 115584  Time= 65.45     Time_left= 24.57        iGF= 41171.51   GF= 41890.50    iGF_per= 5146.44        GF_per= 5236.31
 Prog= 73.24%   N_left= 114816  Time= 66.06     Time_left= 24.13        iGF= 33393.87   GF= 41811.98    iGF_per= 4174.23        GF_per= 5226.50
 Prog= 74.04%   N_left= 113664  Time= 66.79     Time_left= 23.42        iGF= 41230.05   GF= 41805.63    iGF_per= 5153.76        GF_per= 5225.70
 Prog= 74.56%   N_left= 112896  Time= 67.39     Time_left= 22.99        iGF= 32345.48   GF= 41720.09    iGF_per= 4043.19        GF_per= 5215.01
 Prog= 75.08%   N_left= 112128  Time= 67.91     Time_left= 22.54        iGF= 37707.11   GF= 41689.62    iGF_per= 4713.39        GF_per= 5211.20
 Prog= 75.84%   N_left= 110976  Time= 68.77     Time_left= 21.91        iGF= 33207.24   GF= 41583.13    iGF_per= 4150.90        GF_per= 5197.89
 Prog= 76.34%   N_left= 110208  Time= 69.40     Time_left= 21.51        iGF= 30057.75   GF= 41479.33    iGF_per= 3757.22        GF_per= 5184.92
 Prog= 76.83%   N_left= 109440  Time= 69.84     Time_left= 21.06        iGF= 42264.53   GF= 41484.26    iGF_per= 5283.07        GF_per= 5185.53
 Prog= 77.31%   N_left= 108672  Time= 71.08     Time_left= 20.86        iGF= 14652.32   GF= 41013.65    iGF_per= 1831.54        GF_per= 5126.71
 Prog= 78.03%   N_left= 107520  Time= 71.99     Time_left= 20.27        iGF= 29812.22   GF= 40873.13    iGF_per= 3726.53        GF_per= 5109.14
 Prog= 78.49%   N_left= 106752  Time= 72.71     Time_left= 19.92        iGF= 24299.69   GF= 40707.76    iGF_per= 3037.46        GF_per= 5088.47
 Prog= 78.95%   N_left= 105984  Time= 73.13     Time_left= 19.49        iGF= 41940.69   GF= 40714.74    iGF_per= 5242.59        GF_per= 5089.34
 Prog= 79.63%   N_left= 104832  Time= 73.95     Time_left= 18.91        iGF= 31090.48   GF= 40607.58    iGF_per= 3886.31        GF_per= 5075.95
 Prog= 80.08%   N_left= 104064  Time= 74.53     Time_left= 18.54        iGF= 28726.10   GF= 40514.59    iGF_per= 3590.76        GF_per= 5064.32
 Prog= 80.51%   N_left= 103296  Time= 74.98     Time_left= 18.15        iGF= 37150.07   GF= 40494.65    iGF_per= 4643.76        GF_per= 5061.83
 Prog= 81.16%   N_left= 102144  Time= 75.72     Time_left= 17.58        iGF= 32584.68   GF= 40416.72    iGF_per= 4073.08        GF_per= 5052.09
 Prog= 81.58%   N_left= 101376  Time= 76.20     Time_left= 17.20        iGF= 33444.77   GF= 40373.20    iGF_per= 4180.60        GF_per= 5046.65
 Prog= 82.00%   N_left= 100608  Time= 77.02     Time_left= 16.91        iGF= 19037.34   GF= 40145.25    iGF_per= 2379.67        GF_per= 5018.16
 Prog= 82.61%   N_left= 99456   Time= 77.77     Time_left= 16.37        iGF= 31036.06   GF= 40058.23    iGF_per= 3879.51        GF_per= 5007.28
 Prog= 83.01%   N_left= 98688   Time= 78.28     Time_left= 16.02        iGF= 29557.06   GF= 39989.80    iGF_per= 3694.63        GF_per= 4998.73
 Prog= 83.40%   N_left= 97920   Time= 78.75     Time_left= 15.67        iGF= 31521.92   GF= 39939.16    iGF_per= 3940.24        GF_per= 4992.40
 Prog= 83.98%   N_left= 96768   Time= 79.38     Time_left= 15.14        iGF= 34413.02   GF= 39895.00    iGF_per= 4301.63        GF_per= 4986.87
 Prog= 84.36%   N_left= 96000   Time= 79.89     Time_left= 14.81        iGF= 27780.50   GF= 39817.11    iGF_per= 3472.56        GF_per= 4977.14
 Prog= 84.73%   N_left= 95232   Time= 80.34     Time_left= 14.48        iGF= 31196.37   GF= 39768.82    iGF_per= 3899.55        GF_per= 4971.10
 Prog= 85.28%   N_left= 94080   Time= 81.14     Time_left= 14.01        iGF= 25847.14   GF= 39631.79    iGF_per= 3230.89        GF_per= 4953.97
 Prog= 85.64%   N_left= 93312   Time= 81.75     Time_left= 13.71        iGF= 22376.33   GF= 39504.58    iGF_per= 2797.04        GF_per= 4938.07
 Prog= 85.99%   N_left= 92544   Time= 82.14     Time_left= 13.38        iGF= 33941.78   GF= 39478.11    iGF_per= 4242.72        GF_per= 4934.76
 Prog= 86.50%   N_left= 91392   Time= 82.88     Time_left= 12.93        iGF= 26158.93   GF= 39358.40    iGF_per= 3269.87        GF_per= 4919.80
 Prog= 86.84%   N_left= 90624   Time= 83.22     Time_left= 12.61        iGF= 37628.00   GF= 39351.37    iGF_per= 4703.50        GF_per= 4918.92
 Prog= 87.17%   N_left= 89856   Time= 83.72     Time_left= 12.32        iGF= 24988.33   GF= 39265.49    iGF_per= 3123.54        GF_per= 4908.19
 Prog= 88.60%   N_left= 86400   Time= 85.50     Time_left= 11.00        iGF= 30110.08   GF= 39074.56    iGF_per= 3763.76        GF_per= 4884.32
 Prog= 90.05%   N_left= 82560   Time= 88.07     Time_left= 9.73 iGF= 21336.49   GF= 38557.09    iGF_per= 2667.06        GF_per= 4819.64
 Prog= 91.25%   N_left= 79104   Time= 90.30     Time_left= 8.66 iGF= 20259.09   GF= 38105.32    iGF_per= 2532.39        GF_per= 4763.17
 Prog= 92.35%   N_left= 75648   Time= 92.03     Time_left= 7.63 iGF= 23894.57   GF= 37837.86    iGF_per= 2986.82        GF_per= 4729.73
 Prog= 93.45%   N_left= 71808   Time= 94.32     Time_left= 6.61 iGF= 18238.09   GF= 37362.12    iGF_per= 2279.76        GF_per= 4670.26
 Prog= 94.35%   N_left= 68352   Time= 96.12     Time_left= 5.75 iGF= 18874.93   GF= 37016.15    iGF_per= 2359.37        GF_per= 4627.02
 Prog= 95.17%   N_left= 64896   Time= 97.55     Time_left= 4.95 iGF= 21433.84   GF= 36787.46    iGF_per= 2679.23        GF_per= 4598.43
 Prog= 95.90%   N_left= 61440   Time= 99.00     Time_left= 4.23 iGF= 19137.51   GF= 36530.45    iGF_per= 2392.19        GF_per= 4566.31
 Prog= 96.62%   N_left= 57600   Time= 100.51    Time_left= 3.51 iGF= 18001.20   GF= 36251.72    iGF_per= 2250.15        GF_per= 4531.46
 Prog= 97.19%   N_left= 54144   Time= 101.67    Time_left= 2.94 iGF= 18516.54   GF= 36048.39    iGF_per= 2314.57        GF_per= 4506.05
 Prog= 99.14%   N_left= 36480   Time= 106.77    Time_left= 0.92 iGF= 14414.00   GF= 35015.81    iGF_per= 1801.75        GF_per= 4376.98
 Prog= 99.89%   N_left= 18432   Time= 110.04    Time_left= 0.12 iGF=  8624.06   GF= 34231.83    iGF_per= 1078.01        GF_per= 4278.98
 Prog= 100.00%  N_left= 768     Time= 111.76    Time_left= 0.00 iGF=  2427.21   GF= 33742.39    iGF_per= 303.40         GF_per= 4217.80
2020-02-16 03:11:15.757
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02L2L2      178176   384     2     4             112.82              3.342e+04
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0031046 ...... PASSED

The result of our implementation in CAS

rocblas自身的实现为架构通用的tensile 机器码,很少会顾及isa架构相关的优化如内存读入,寄存器分配,block size大小等。在对rocblas_sgemm_strided_batch 和自己写的naive版本的batch进行profiling和extractkernel后,着重发现和了解了几个重难点。首先是块与线程部分。第一最高线程速率,硬件生成Threads的速率将直接影响最终程序的效率, 例如GPU显存的读写速度,测试发现gfx906获得64 threads/Cycles的极限性能;第二是1D形状的线程速率曲线。测试得到仅仅当 BlockDim = 256, 512, 1024时, 线程产生速度达到峰值。也即是如果能够将原来4 threads的工作合并到一个thread,每个线程处理的事务随之提高到4倍,例如读写操作,将极大地提高理论极限。在测试了2D 和 3D 之后得出以 256倍的BlockDim性能最佳。其次,在Instruction cahceline的部分,结果表明增加越多的线程能更有效的增加SIMD的效率。在对dump后的VGPR寄存器分析后,发现:HIPCC大多数时候不使用s_load_dword,C ++ Statemetn不能将内联汇编的输出用作操作数,C ++的操作数必须来自C ++,64位地址或数据在内联汇编中很难调用,内联汇编是一个很难用C ++变量,if-esle,for-loop等控制的程序。HIPCC分配SGPR / VGPR的情况不太好GCN限制:s_load_DWORDx4,x8,X16 SGPR必须以x4地址开头内联汇编很难具有正确的VGPR / SGPR设置。由此引出了我们的解决方案:每个 WorkGroup 的 Macro-Tile和Micro-Tile 的分配问题,也即是VLP SGEMM 大小为128的WorkGroup的Macro使用 M=64,N=128,256 的Macro则为 M=64,N=256,每个线程为的Micro tile 大小为M=64, N=1,即每个线程运算Matrix A= 64xK, Matrix B = Kx64, 结果在 Matrix-C 64 x1。对于64个线程,每个M的Matrix-C地址是连续的。每个Wave 的 Matrix A 的Basic Offset 被设定为 N/64 *64 *lda;而B为M/64 *64 lda,取数据的指令为s_load_dwordx8 s[32:39], s[12:13], s18。GCN架构总共有96个可用的SGPR 这个算法使用s32到95,只有64个读取A,而88的并行读入设计使得效率提升。对B来说,每个线程使用微区块大小M = 64,N = 1。每个线程需要8个VGPR来加载1个N的8xK数据。该算法使用global_load_dwordx4来获得最佳的缓存行命中率。下一条存储器读取指令读取同一高速缓存行的下4个DWORD。关于VGPR分配,每个线程需要V [2:3]作为矩阵B的每个线程偏移量。矩阵B的双缓冲区加载需要16倍VGPR。这样总共83个剩下的VGPR负责每个SIMD 3个Waves 得到了很好的性能表现。还有,这种先由先分配变量至寄存器再反编译到机器码(相当于inline 静态库)的方式使得完全没有调度器带来的barrier 和LDS(LDS访存慢于L1 和VGPR)。最终,完成这些操作能使gdx906最高达到77%的性能释放。

doc

编译运行

make
./lib/sgemm_strided_batch_final -m 512 -n 512 -k 256 --batch_count 10

生成

make compile_co

Creativity

深入研读gpu,也即是gcn架构的体系结构相关知识。用汇编和反编译代码的方式优化。主要为优化代码,尤其是gpu代码提供一种优化思路,即先编译分配好VGPR的inline函数和其他一些工具到机器码,再反编译到.co文件,被需要的cpp文件当作外部库来使用,可以极大地利用体系机构地优势从而加速sgemm。

Applications

SgemmBatchedStrided 的应用领域非常多。但是我认为最能体现本答卷价值的是CNN的Convolution,即用体系架构的优化代码方式优化现有CNN代码,用profile 和dump工具对现有的CNN Convolution 汇编分析,比如可以看的点主要有一/二级缓存命中延时、缓存行长度,接下来就可以用简单的汇编代码inlin 再反汇编进行优化。

超算开机需要注意的几件事

这是小袁的锅

bios:顺序一定要让centos 先。超微主板的flexyboot 一定要disable 否则就会像这次一样卡死。

开机自启服务:

  • 1.网卡nm r ncli
  • 2.login 服务
  • 3.ssh
  • 4.disable firewall

其他都可以再开机以后做,所以无所谓。

再看下ht开没开。nvidia-smi 在不在。内存有没有掉。

超微为了减少功耗是可以热插拔pcie 的,因为显卡插着就有25w功耗,热量积攒就算不用也会上升到平均30w左右。所以开机看看能不能热插拔。

最后总之能不重启就不重启。

[Parallel Computing] MPI消息传递模型

MPI 对于消息传递模型的设计
在开始教程之前,我会先解释一下 MPI 在消息传递模型设计上的一些经典概念。第一个概念是通讯器(communicator)。通讯器定义了一组能够互相发消息的进程。在这组进程中,每个进程会被分配一个序号,称作秩(rank),进程间显性地通过指定秩来进行通信。

通信的基础建立在不同进程间发送和接收操作。一个进程可以通过指定另一个进程的秩以及一个独一无二的消息标签(tag)来发送消息给另一个进程。接受者可以发送一个接收特定标签标记的消息的请求(或者也可以完全不管标签,接收任何消息),然后依次处理接收到的数据。类似这样的涉及一个发送者以及一个接受者的通信被称作点对点(point-to-point)通信。

当然在很多情况下,某个进程可能需要跟所有其他进程通信。比如主进程想发一个广播给所有的从进程。在这种情况下,手动去写一个个进程点对点的信息传递就显得很笨拙。而且事实上这样会导致网络利用率低下。MPI 有专门的接口来帮我们处理这类所有进程间的集体性(collective)通信。