Reference:
- https://github.com/fsword73/SGEMM_on_VEGA/blob/master/sgemm_sqc_test.cpp
- https://github.com/NervanaSystems/maxas/wiki/SGEMM
- https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/2_Cookbook/10_inline_asm
Device:
https://gist.github.com/victoryang00/cd6324acffc5ac79464d8409f768656a
outher good reference:
https://cas-pra.sugon.com/detail.html?tournament_id=6
- https://s0docs0nvidia0com.icopy.site/cuda/archive/10.0/cublas/index.html#cublas-level-1-function-reference
- https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/
- https://ieeexplore.ieee.org/document/7839684
- https://hgpu.org/?p=15361
- https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
- https://devblogs.nvidia.com/cuda-pro-tip-fast-robust-computation-givens-rotations/