Rethinking the design of CXL Fault Tolerant Distributed System

  1. Ray-like software replication will introduce 20% overhead compared with MPI for distributed PyTorch training. Hardware-assisted replicas and erasure code should be implemented in the remote memory. The distributed kernel should be aware of the data's presence, how much time it takes for reconstruction, and the reliability rate for deciding where to put the data.
  2. Page table way of memory mapping seems tedious for local hardware resources of MSHR/TLB/ROB for hiding latency. The bound check could happen in the remote CXL pool ACL part. At the same time, the language runtime support should also comply with the check.