ISC22 code challenge

文章目录[隐藏]

Task 1 Modifying the given apps to leverage NV GPU
Task 2 Performance assessment of original versus modified graph
Task 3 Summarize all the findings and results
参考

感觉这次的code challenge 必然会很卷，因为有GaTech/ETHz这种HPC强校，感觉就是NV想从我们学生这榨干点优化，正如罡兴投资激发的大家对网卡simd优化产生了兴趣。这篇就稍微记录下可优化的点。

Compact 3D是做流体力学仿真的。要做的是mpi alltoall 到 DPU 上的移植+调优。

ConnectX-6 network adapter with 200Gbps InfiniBand
System-on-chip containing eight 64-bit ARMv8 A72 cores with 2.75 GHz each
16 GB of memory for the ARM cores

bluefield+DPU MPI 做的一件事就是自动的把nonblocking collective数据offload到Arm core的memory上，处理完再发包。（但如果对包精巧设计的话，应该是没FPGA可以做的更快）或者能自动生成FPGA code。当然arm core上的所有代码都是可开发的，需要配合sdk使用。

在默认库在没有offloading的机器上MPI nonblocking collective 重叠很少，这样Host CPU可以做更多的计算从而不需要等all to all的结果。

Requirements for Next-Generation MPI Libraries

Message Passing Interface (MPI) libraries are used for HPC and AI applications
Requirements for a high-performance and scalable MPI library:
Low latency communication
High bandwidth communication
Minimum contention for host CPU resources tr progress non-blocking collectives
High overlap of computation with communication
CPU based non-blocking communication progress can lead to sub-par performance as the main application has less CPU resources for useful application-level computation
Network offload mechanisms are gaining attraction as they have the potential to completely offload the communication of MPI primitives into the network

Current release

Supports offloads of the following non-bıocking collectives
- Alltoall (MPI_lalltoall)
- Allgather (MPI_lallgather)
- Broadcast (MPI_Ibcast)

Task 1 Modifying the given apps to leverage NV GPU

需要使用DPU offload mode 跑通 https://github.com/xcompact3d/incompact3d，改改parameters。

Task 2 Performance assessment of original versus modified graph

就是跑一个8node strong scaling 的图，PPN 4-32.

Task 3 Summarize all the findings and results

讲分工和improvement的，比如对FFT的优化。（PDE都需要）

subroutine transpose_x_to_y_real(src, dst, opt_decomp)
 implicit none
 real (mytype), dimension $(:,:,:)$, intent(IN) : : src
 real (mytype), dimension $(:,:,$, , intent(OUT) : : dst
 TYPE (DECOMP_INFO), intent(IN), optional :: opt_decomp
 TYPE(DECOMP_INFO) : : decomp
 ! rearrange source array as send buffer
 call mem_split_xy_real(src, s1, s2, s3, work1_r, dims(1), &
  decomp% 1dist, decomp)
 ! transpose using MPI_ALLTOALL (V)
 call MPI_ALLTOALLV (work 1_r, decomp% 1 cnts, decomp% 1 disp, &
  real_type, work2_r, decomp\%y 1cnts, decomp%y 1disp, &
  real_type, DECOMP_2D_COMM_COL, ierror)
 call mem_merge_xy_real(work2_r, d1, d2, d3, dst, dims(1), &
  decomp%y 1 dist, decomp)
 return
end subroutine transpose_x_to_y_real

change to non-blocking transpose prototype. 就是先post 一个MPI communication，主进程还是继续，子进程还是blocking

interface transpose_x_to_y_start
  module procedure transpose_x_to_y_real_start
  module procedure transpose_x_to_y_complex_start
end interface transpose_x_to_y_start
interface transpose_x_to_y_wait
  module procedure transpose_x_to_y_real_wait
  module procedure transpose_x_to_y_complex_wait
end interface transpose_x_to_y_wait

除此之外，他们还做了shared memory的优化。

#ifdef SHM
    if (decomp%COL_INFO%CORE_ME==1) THEN
       call MPI_ALLTOALLV(work1, decomp%x1cnts_s, decomp%x1disp_s, &
            complex_type, work2, decomp%y1cnts_s, decomp%y1disp_s, &
            complex_type, decomp%COL_INFO%SMP_COMM, ierror)
    end if
#endif

Sync point 会在wait的时候，start到wait之间是overlap compute and communication的时间。

启动的时候需要加参数-dpufile。

Task 1 Modifying the given apps to leverage NV GPU

Task 2 Performance assessment of original versus modified graph

Task 3 Summarize all the findings and results

参考

Share this: