QE - vickieGPT’s blog

文章目录[隐藏]

GPU 部分
CPU 部分
IO500 部分

GPU 部分

cmake 通过修改开关,只让GPU使用 ptx 及 fortran intrinsic ABI 直接编译，替代 OpenACC的 kernel 实现。MPI 库实现替换为 ucx gpr_copy. AUARF32 时间从 4m59s 提升至 1m51s。但在多卡上仍就有很大的通讯开销。由于 cuda_ipc 在 x86 架构上只能通过 PCIe 传输，4卡高达1小时左右。即便是开了-nk 2的情况下。 issue

profiling 总结

cuFFT 使用较多， GPU实现的kernel次之，最多的运算是高维矩阵的浮点数运算，但有一定的 cache time locality 的可能，可能的优化方向是 grid 调参， ptx 调优、 Unified GPU memory 利用。
VKFFT

pitfalls

https://github.com/MPAS-Dev/MPAS-Model/issues/554
https://forums.developer.nvidia.com/t/problem-with-nvfortran-and-r/155366
LibGOMP not IMPLEMENTED: fftw/scalapack/hdf5/elpa is not dependent on the compiler's lib.

通过例子更新 wiki Fortran 部分代码

vecAdd 例子 OpenAcc 与 cuda kernel 与 fortran ptx(range) 的比值约为 1.3:1:0.8.
使用 fortran 调用 cufft 及 cublas

CPU 部分

测试案例 cache 优化

针对 AUARF112 测试案例，在针对 core.F90 中的 scf 场进行 cache 调优

换 malloc 库优化

尝试对比内存使用率

IO500 部分

Rust 编译器踩坑中，深入了解了 Send Sync
了解 client 中 ctor 的特性。