About the MHSR of the LLC miss with CXL.mem devices

In [1], the author talked about the Asynchronous Memory Unit that the CPU and Memory controller needs to support of co-design.

The overhead of hardware consistency checking is one reason that limits the capacity of traditional load/store queues and MSHRs. The AMU leaves the consistency issue to the software. They argue that software and hardware cooperation is the right way to exploit the memory parallelism over large latency for AMU.

As shown in the Figure of sensitivity tests in [2], the decomposition analysis of DirectCXL shows a completely different result: no software and no data copy overhead. As the payload increases, the main component of the DirectCXL latency is the LLC (CPU Cache). This is because the Miss State Holding Register (MSHR) in the CPU LLC can handle 16 concurrent misses, so with large payload data, many memory requests (64B) are suspended on the CPU, and processing a 4KB payload takes up 67% of the total latency.

The conclusion is MHSR inside the CPU is not enough to deal with memory load in the CXL.mem world, and both the latency and the bandwidth are so diverse across the serial PCIe5 lane. Also, another possible outcome compared with RDMA SRQ approach of the controller, we think the PMU and semantics of coherency still matter and the future way of persistency according to the Huawei's approach and SRQ approaches will fall back to ld/st but with a smarter leverage in the MC that asynchronously ld/st the data.

Reference

  1. Asynchronous memory access unit for general purpose processors
  2. Direct Access, High-Performance Memory Disaggregation with DirectCXL

Is MMAP still good for Post CXL era?

A short answer is no.

  1. MMAP a huge file need OS to register a virtual address to mmap the file on; once any request to the file is made, we may use page fault to load the file from disk to the private DRAM and setup the va_to_pa and buffer the file part in the DRAM, maybe use TLB to cache the next read. Every CXL device has it own mapping of memory; if you MMAP memory that was swapped onto CXL.mem devices like memory semantic SSD, the controller of SSD may decide whether to put on on-SSD DRAM or SSD and, in the backend, write through everything on physical media. CXL vendors drastically want to implement the defered allocation that lazily setup the physical memory to the virtual mmemory, which overlaps the MMAP mechenism.
  2. MMAP + madvise/numabind to certain CXL attached memory may cause migration efforts. Once you dirty write the pages, the transaction is currently not yet introduced in the CXL protocol. The process takes pains to implement the mechesim correctly. Instead, we can do something like TPP or CXLSwap, making everything transparent to applications. Or, we can make 3D memory and extend computability in CXL controller to decide where to put the data and maintain the transaction under the physical memory.
  3. MMAP is originally designed for a fast track memory together with a slower track disk like HDDs. Say you are loading graph edges from a large HDD backed pool. The frequently accessed part will be softwarely defined as a stream pool for cold/hot data management. Here MMAP can both leverage the OS page cache semantic transparently, but it's not case with more and faster endpoints. With more complexity of topology of CXL NUMA devices, we could handle fewer error at a time and serve more for the speed of main bus. Thus, we don't stop for page fault and requires those be handled in endpoints side.

Thus we still need SMDK such management layer to make jemalloc+libnuma+CXLSwap for CXL.mem. For interface with CXL.cache devices, I think defer allocation and managing everything through virtual memory would be fine. Thus we don't need programming models like CUDA; rather, we can static analysis through MLIR to do good data movement hint to every CXL controller's MMU and TLB. We could leverage CXL.cache cacheline state to treat as streaming buffer so that every possible endpoints read for and then do updates by next write.

Reference

  1. https://db.cs.cmu.edu/mmap-cidr2022/
  2. https://blog.csdn.net/juS3Ve/article/details/90153094

Azure Mapper vs. Graph DB vs. RDBMS

The motivation dataset

CompuCache Design of VM

The comparison

Azure Mapper Graph DBs RDBMS
Pros Parallel Computing Recursive Ptr Deref Chase edges across servers Various neighborhood queries Extended buffer pool Semantic cache Temporary-data store
Benefit by Avoiding strictly serializing the roud tip RPC Avoiding strictly serialized full round trip RPC overheads Index lookups Predicates on MVs

Sproc (Stored Procedure) and eRPC abstractions

Sproc CompuCache
An application specifies a sproc by calling the Register function and executes it by calling the Execute function. CompuCache uses eRPC, a user-space RPC library that runs on DPDK or RDMA.
Sproc code is a parameter to Register and is broadcast to all CompuCache servers. DPDK avoids OS kernel overhead by executing the networking in user space.
On each server, the code is compiled locally as a dynamic library and loaded into the runtime of the server. RDMA offers the further benefit of offloading the CPU by running most of the networking protocol on the network interface card (NIC).
A CompuCache server might not have all the data needed to execute a sproc, which requires coordination with other CompuCache servers To leverage the full bandwidth of high-speed networks, CompuCache batches small operations in a single network transfer. This includes batching small I/O requests and responses and all sproc execution requests.

CompuCache Server

We have server-side Pointer Chasing by LocalTranslator Data Structure provides Translate.

An sproc invokes this function to map virtual cache addresses into physical server locations.

Cross-Server Pointer Chasing Handling OOB exceptions.

Result

Possible Migration to CXL

  1. Highly correlated with far memory setup. and type 3 can be used for better memory access, without the need of using ptr chasing to leverage the info of far memory.
  2. Can also leverage the Near data processing for memory node in a memory polling setup.

Reference

  1. https://www.youtube.com/watch?v=XXJj1nJuLbo

SMDK 三星的CXL开发套件

这个是干啥的?

三星和海力士都出了他们自家的prototype,暂时的大概意思是PCIe attached FPGA+DDR5模拟CXL.mem逻辑的Proof of Concept。因为CPU的CXL2.0还没出来,与之对应的cfmws type3(带pmem的读写)指令还没实现,所以一个通过 PCIe4.0 连接的CXL.mem逻辑很容易实现,也很容易先完成性能PoC。

与PMDK的比较

ndctl

PMDK 的 hw/sw interface 实现在 ndctl 中(之前做PM re的时候玩过),有给iMC的指令,告诉他们什么模式启动(FSdax/devdax/Memory Mode)然后pmdk底下你issue clflush等指令还是会需要FS来维护index和crash atomicity.

SMDK 在kernel启动的时候会把SRAT(memory affinity)/CEDT(CXL Early Discovery Table)/DVSEC(Designated Vendor-Specific Extended Capability) 三选一BIOS启动参数传给kernel,告诉ACPI的哪个地址有cxl device,intel为了制定标准,把cxl ctl 也集成进ndctl了。这个命令可以获得hw信息、创建label、group等。主要的逻辑都是一样的。

in-Kernel zone memory management

在启动的时候会configure一个memory channel type在(mm/kalsr.c中我们可以看到configure过程,和pmem e820设备不同区域,所以会call在 driver/cxl/exmem.c 下的启动程序,被configure成exmem),所有的PCIe/CXL logic都写在硬件的PoC上了

写PCIe设备的时候(先发出mov CPU地址, PCIe DMAed address, mov在DMA设备读到mmio的时候retire,PCIe设备会从DMA地址读到自己的BAR并拷贝到device RAM中。三星做的是在这里完整的模拟了CXL的过程,传输层,事物层。物理层还是走PCIe5.0。如果是反过来mov则是逆过程。板上实现了IOMMU与否不重要,如果没实现只要有DMA就行了。

如果我写一段地址到exmem mapped 的memory 上,三星的设备会对应接收到DMA请求并开始PoC板上的写内存请求,kernel 在page level会check这个page是不是在exmem上(只需要移位操作比较虚拟内存就可以了)。由于PCIe内存还是相对较慢的,最快走一次PCIe5.0也要300ns,这个模拟还是只能看看样子。


他们的road map支持在不同的cxl node/单个node做expander来configure不同的zone。

libnuma integration

三星在libnuma中插入了新的zone,以暴露借口给他们的smalloc。

jemalloc integration

Reference

  1. https://www.youtube.com/watch?v=Uff2yvtzONc
  2. https://www.youtube.com/watch?v=dZXLDUpR6cU
  3. https://www.youtube.com/watch?v=b9APU03pJiU
  4. https://github.com/OpenMPDK/SMDK
  5. PCIe 体系结构