如何使用nccl （更好的）代替mpi（rdma）

我们在开科学计算的选项的时候总是会碰到这种问题：内存如何分配。尤其是用到gpu的计算的时候。很难有减少内存传输overhead 的办法。正如这次asc quest题目作者所说的：

Firstly, QuEST uses its hardware to accelerate the simulation of a single quantum register at a time. While I think there are good uses of multi-GPU to speedup simultaneous simulation of multiple registers, this would be a totally new pattern to QuEST's simulation style. So let's consider using multi-GPU to accelerate a single register.

There are a few ways you can have "multiple GPUs":

multiple NVlinked GPUs
This is when you have multiple GPUs tightly connected with a high-bandwidth fabric (e.g. this). The bandwidth is enough that you sort of can imagine it as a single big GPU, and hence it would be worthwhile for accelerating single-register simulation. However, this only exists right now as NVLink and NVSwitch, compatible only with IBM's POWER architecture - you could argue this is still esoteric, and not worth a big refactor. Note it wouldn't actually be very hard to refactor QuEST for this platform - indeed QuEST works out-of-the-box with POWER8. But it's not something on our TODO list currently.
multiple local GPUs
This is when you have multiple GPUs on the same machine, but maybe on different sockets and hence with a much lower bandwidth between them. The most common case is two GPUs - is it worthwhile using two GPUs over one to speedup single register simulation? Often, no!
In big QC simulation, having to move memory around is often the big killer, and should be avoided where possible. Unfortunately, simulating unitaries on registers often requires moving memory. If all the memory stays in the GPU (very high "internal bandwidth"), this is ok, but copying memory to the other GPU (across the socket) will introduce a huge per-gate overhead!
Hence, using two GPUs to simulate the same register size can be slower than using just one, especially as the simulation size grows and saturates the sockets!
There's hardly a benefit from the extra VRAM too, because doubling the memory enables simulation of one additional qubit. This is not worth the slowdown, or the hardware!
Even with more than two GPUs, the connections are likely hierarchical and so even more prone to saturation.
distributed GPUs
This is when you have a GPU(s) on each distributed node of a cluster. In this circumstance, simulating a unitary gate which requires data exchange not only costs us a VRAM to RAM overhead (similar to before), but a networking overhead to talk to the other nodes! This can be somewhat improved by having a direct GPU to network-card connection (and MPI abstraction), but I believe that's pretty cutting-edge.
Let's say you have n nodes, each with a GPU and a multicore CPU, and you're resolved to a distributed simulation. When is it worthwhile to pay the extra memory overhead locally copying from RAM to VRAM (and use the GPU), over using just the CPUs? This is now the same trade-off to consider in the previous cases. So may or may not be worthwhile.

作者没有实现distributed gpu的主要原因是他觉得显存带宽复制的时间overhead比较大。不如就在单gpu上完成就行了。但是未来随着gpu显存的不断增加和计算的不断增加，这个放在多卡上的需求也与日俱增。

翻了很多关于gpu显卡的通讯，无论是单node 多卡还是多node多卡。最绕不开的就是mpi。但是当我找到一个更优秀的基于 infiniband 网卡的rdma的数据共享协议，让我眼前一亮，决定就用这个。如果你不是要写cuda 而是pytorch 请绕步pytorch distributed doc，如果是python 版本的nccl 可以选择

参考文档：

主要的步骤：先照着nccl文档安装nv_peer_memory, 再装nccl 最后装plugin。

安装后之后有两个test 一个是 gdrcopy 另一个是 nccl-tests。跑的命令是 mpirun -N 1 --allow-run-as-root --hostfile host -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_HCA=mlx4_0 -x NCCL_SOCKET_IFNAME=ib0 -x LD_LIBRARY_PATH=/usr/local/nccl/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 16M -e 1024M -g 4