Phosphor - My Pitfalls writing dependency

Currently, I'm busy writing emails for my Ph.D and taking TOEFL and taking care of the Quantum ESPRESSO library changing and MadFS Optimization, so it may waste some time. Till now, I have to apply the DTA tool of phosphor for the java order dependency project.

about surfire integration into normal tests.

  • Maven extension
    • Integration into Maven add the redirector
      • Insert phosphor plugin one class by one into.
      • Configuration to the phosphor
      • Class Visitor, Method Visitor, Adaptor Mode Visitor
    • Mutable field in the Dependency Tainter
      • Start the taint for some place attach the tainted check after the test
      • Assert the junit stuf in check=omparison.
      • Brittle assertions in check(Taint) recursively.
    • Output the tainted version into the sufire executable folder
  • Debug
    • mvn install -Dmaven.surefire.debug -f /Volumes/DataCorrupted/project/UIUC/bramble/integration-tests/pom.xml and attach the trace point.
      • Start from the maven compilation.

Brittle Assertion

This outputs only the dependency for one test introduced in Oracle Polish JPF. For dependenct between test1 and test2,

For NPE, get the pair by idflakies test first.

 JVM Asm

Reference

  1. https://www.kingkk.com/2020/08/ASM%E5%8E%86%E9%99%A9%E8%AE%B0/

QE

GPU 部分

cmake 通过修改开关,只让GPU使用 ptx 及 fortran intrinsic ABI 直接编译,替代 OpenACC的 kernel 实现。MPI 库实现替换为 ucx gpr_copy. AUARF32 时间从 4m59s 提升至 1m51s。 但在多卡上仍就有很大的通讯开销。由于 cuda_ipc 在 x86 架构上只能通过 PCIe 传输,4卡高达1小时左右。即便是开了-nk 2的情况下。 issue

profiling 总结

  1. cuFFT 使用较多, GPU实现的kernel次之,最多的运算是高维矩阵的浮点数运算,但有一定的 cache time locality 的可能,可能的优化方向是 grid 调参, ptx 调优、 Unified GPU memory 利用。
  2. VKFFT

pitfalls

  1. https://github.com/MPAS-Dev/MPAS-Model/issues/554
  2. https://forums.developer.nvidia.com/t/problem-with-nvfortran-and-r/155366
  3. LibGOMP not IMPLEMENTED: fftw/scalapack/hdf5/elpa is not dependent on the compiler's lib.

通过例子更新 wiki Fortran 部分代码

  1. vecAdd 例子 OpenAcc 与 cuda kernel 与 fortran ptx(range) 的比值约为 1.3:1:0.8.
  2. 使用 fortran 调用 cufft 及 cublas

CPU 部分

测试案例 cache 优化

针对 AUARF112 测试案例,在针对 core.F90 中的 scf 场进行 cache 调优

换 malloc 库优化

尝试对比内存使用率

IO500 部分

  1. Rust 编译器踩坑中,深入了解了 Send Sync
  2. 了解 client 中 ctor 的特性。