SMDK 三星的CXL开发套件 (SK海力士的HMSDK)

这个是干啥的?

三星和海力士都出了他们自家的prototype,暂时的大概意思是PCIe attached FPGA+DDR5模拟CXL.mem逻辑的Proof of Concept。因为CPU的CXL2.0还没出来,与之对应的cfmws type3(带pmem的读写)指令还没实现,所以一个通过 PCIe4.0 连接的CXL.mem逻辑很容易实现,也很容易先完成性能PoC。现在听人说三星用的rambus IP的延时不好于demisifying那篇文章。

与PMDK的比较

ndctl

PMDK 的 hw/sw interface 实现在 ndctl 中(之前做PM re的时候玩过),有给iMC的指令,告诉他们什么模式启动(FSdax/devdax/Memory Mode)然后pmdk底下你issue clflush等指令还是会需要FS来维护index和crash atomicity.

SMDK 在kernel启动的时候会把SRAT(memory affinity)/CEDT(CXL Early Discovery Table)/DVSEC(Designated Vendor-Specific Extended Capability) 三选一BIOS启动参数传给kernel,告诉ACPI的哪个地址有cxl device,intel为了制定标准,把cxl ctl 也集成进ndctl了。这个命令可以获得hw信息、创建label、group等。主要的逻辑都是一样的。新的机器,只要实现了cxl.mem,内核里都会提供cxl 这个command做类似的事情。现在悬而未决的是管理cxl内存hardware axiliary的hot page inspection,用kernel里的hmm还是iouring管理cxl.cache或.mem。

in-Kernel zone memory management

在启动的时候会configure一个memory channel type在(mm/kalsr.c中我们可以看到configure过程,和pmem e820设备不同区域,所以会call在 driver/cxl/exmem.c 下的启动程序,被configure成exmem),所有的PCIe/CXL logic都写在硬件的PoC上了

写PCIe设备的时候(先发出mov CPU地址, PCIe DMAed address, mov在DMA设备读到mmio的时候retire,PCIe设备会从DMA地址读到自己的BAR并拷贝到device RAM中。三星做的是在这里完整的模拟了CXL的过程,传输层,事物层。物理层还是走PCIe5.0。如果是反过来mov则是逆过程。板上实现了IOMMU与否不重要,如果没实现只要有DMA就行了。

如果我写一段地址到exmem mapped 的memory 上,三星的设备会对应接收到DMA请求并开始PoC板上的写内存请求,kernel 在page level会check这个page是不是在exmem上(只需要移位操作比较虚拟内存就可以了)。由于PCIe内存还是相对较慢的,最快走一次PCIe5.0也要300ns,这个模拟还是只能看看样子。我觉得傻逼在于1.复用mmap的flag来做内存分配器实际上是在分配时做决定,并不和numa allocator逻辑一样。2.mmap对于deferrable write的cxl内存并不友好。


他们的road map支持在不同的cxl node/单个node做expander来configure不同的zone。

libnuma integration

三星在libnuma中插入了新的zone,以暴露借口给他们的smalloc。

jemalloc integration

给的是jemalloc的接口,因为zonewise 的用户态内存分配器已经被研究的很透彻了。

YYW的尝试

我尝试在numa上测性能,没用他们的kernel,问题是lazy allocation,如monetdb mmap一块很大内存是很难搞的。

Reference

  1. https://www.youtube.com/watch?v=Uff2yvtzONc
  2. https://www.youtube.com/watch?v=dZXLDUpR6cU
  3. https://www.youtube.com/watch?v=b9APU03pJiU
  4. https://github.com/OpenMPDK/SMDK
  5. PCIe 体系结构
  6. SMT: Software Defined Memory Tiering for Heterogeneous Computing Systems With CXL Memory Expander
  7. https://www.arxiv-vanity.com/papers/1905.01135/
  8. https://arxiv.org/pdf/2303.15375v1.pdf

PCIe/CXL “网络层”通讯详解

引入

首先,我们为什么需要一个PCIe attached memory or cache协议,重点是CPU上memory channel的局限性,你无法多加过多的并行的Memory Bus。虽然这对memory的随机读写有好处,现在的channel个数大概满足了CPU-memory Ratio,开多少线程跑load&store都能满足CPU的需求。

过多的线程会在核内空转也不会issue超过CPU频率和算力的内存指令。大家可以想象一个roofline model x轴为什么是arithmatic density的原因。同时淘汰浪费内存带宽的3D Xpoint也成了必然。那么串行的PCIe协议访问memory就非常有意义了,Meta的workload告诉我们80%的互联网应用是capacity bound,意思是我有一个很大的data warehouse,需要low latency访问的,也即是用户即将要显示在终端设备上的其实很少。只需要保证在短时间内load到private DRAM,就满足了。

例子

让我们从两个实际例子开始。

  1. 如果今天有人要创建一个基于PCIe的内存扩展设备,并希望该设备能够暴露相干字节寻址的内存,那么实际上只有两个可行的选择。一个人可以通过基础地址寄存器(BAR)暴露这个内存映射的输入/输出(MMIO)。如果没有Hack,唯一合理的方法是需要有CPU支持,将MMIO映射为未缓存(UC),这对性能有明显的影响。关于对GPU的连贯性内存访问的更多细节,可以看看Nvidia 的Hack。对设备内存的访问不受协议的限制,而我们还没有设法完成这个目标。事实上,NVMe 1.4规范引入了持久性内存区域(PMR,区别于C++20的pmr),它可以做到这一点,但仍然是有限的。

  2. 如果创建一个基于PCIe的设备,其主要工作是进行网络地址转换(NAT)(或其他一些IP数据包修改),这将由CPU完成,为此需要关键的内存带宽。这是因为CPU将不得不从设备中读取数据,对其进行修改,并将其写回,而通过PCIe的唯一方法就是通过主内存来完成。

传输格式

通过串行的传输协议我们会获得Non-deterministic memory latency,除了极端情况下放在核电厂旁边不停丢包以外,更会受到CXL Switch over subscription的影响.

使用DRAM介质直连CPU的内存和NVDIMM不到100ns,通过PCIe串行连接的缓存一致性协议CXL(XMM、NV-XMM模组和AIC)、CCIX可以达到350ns延时;OpenCAPI的DDIMM也只有40ns;而Gen-Z这样经过外部Switch/网络连接的在800ns水平。

PCIe 传输格式

包头所对应的不同层传输格式



Memory configuration space有32bit BAR限制.需要一开始就指定是32/64来获得3DW还是4DW

Completion 返回的 Ack 是分别对应之前的 Memory 请求。

最后值得注意的是Transaction Descriptor Attribute 会指定IO的Ordering和CPU的Ordering/Snooping

End Point通常是我们最感兴趣的,因为那是我们放置高性能设备的地方。它是样本框图中的GPU,而在实时情况下,它可以是一个高速以太网卡或数据收集/处理卡,或一个infiniband卡与大型数据中心的一些存储设备communication。下面是一个框图,放大了这些组件的互连。

基于这个拓扑结构,让我们来谈谈一个典型的场景,其中远程直接内存访问(RDMA)被用来允许终端PCIE设备在数据到达时直接写入预先分配的系统内存,这最大限度地卸载了CPU的任何参与。因此,设备将发起一个带有数据的写入请求,并将其与希望的Root ComplexRoot一起发送,其将数据输入系统内存.

CXL 增加了啥

  1. ATS/MSI-x
  2. transportation metadata
  3. flip (faster in 3.0)

Problems

2.0 .mem也需要在背后维护一套directory based的coherency protocol,如何实现是个问题。3.0有很多的内存序问题,MESI是否是一个过于慢的设计?总线怎么设计?

Reference

  1. https://par.nsf.gov/servlets/purl/10078086
  2. https://www.youtube.com/watch?v=Uff2yvtzONc
  3. https://bwidawsk.net/blog/2022/6/compute-express-link-intro/#cxl.mem
  4. https://www.computeexpresslink.org/download-the-specification
  5. https://www.youtube.com/watch?v=fpAFvLhTpqw
  6. https://www.youtube.com/watch?v=caiREMKP0-E&t=7s

Software-Defined Far Memory in Warehouse-Scale Computers

  1. design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop.
    • machine learning algorithm called Gaussian Process (GP) Bandit [17, 21, 39].
  2. we demonstrate that zswap [1], a Linux kernel mechanism that stores memory compressed in DRAM, can be used to implement software-defined far memory that provides tail.
    • The control mechanism for far memory in WSCs requires
      1. tight control over performance slowdowns to meet defined SLOs
      2. low CPU overhead so as to maximize the TCO savings from far memory.
  3. Cold Page Identification Mechanism
    • We base this mechanism on prior work [28, 42, 46].
    • we design our system to keep the promotion rate below P% of the application’s working set size per minute, which serves as a Service Level Objective (SLO) for far memory performance.
    • define the working set size of an application as the total number of pages that are accessed within minimum cold age threshold (120 s in our system).
    • The exact value of depends on the performance differ P ence between near memory and far memory. For our deployment, we conducted months-long A/B testing at scale with production workloads and empirically determined P to be 0.2%/min.
  4. Controlling the Cold Age Threshold, build a for each job in the OS kernel promotion histogram
    • our system builds per-job for a given cold page histogram set of predefined cold age thresholds.
    • We use Linux’s memory cgroup (memcg) [2] to isolate jobs in our WSC.
    • We use the lzo algorithm to achieve low CPU overhead for compression and decompression
    • We maintain a global zsmalloc arena per machine, with an explicit compaction interface that can be triggered by the node agent when needed.
    • Empirically, there are no gains to be derived by storing zsmalloc payloads larger than 2990 bytes (73% of a 4 KiB x86 page), where metadata overhead becomes higher than savings from compressing the page.
    • called [17, 21, 39]. GP Bandit Gaussian Process (GP) Bandit learns the shape of search space and guides parameter search towards the optimal point with the minimal number of trials.

When Prefetching works, when it doesn’t, and why

For the HPC system, prefetching both hw/sw requires sophisticated simulation and measurement to perform better.

The purpose of the author is because first, there is not much explanation of how best to insert prefetch intrinsic. Second, not a good understanding of the complex interactions between hardware prefetching and software prefetching.

The target of soft prefetching is found to be short array streams, irregular memory address patterns, L1 cache miss reduction. There is a positive effect, however, since software prefetching trains hardware prefetching, a further part of the situation is a bad effect. Since stream and stride prefetchers are the only two currently implemented commercially, our hardware prefetching strategies are restricted to those.

As the picture depicted above, the software can easily anticipate array accesses and prefetch data in advance. Stream means accessing data across a single cache line while stride means accessing across more than two cache lines.

For Software Prefetch Intrinsics, temporal for the use of the data to be used again, so the next layer still have to put, such as L1 have to put L2, L3 are put, so that when the data is kicked out of the L1 cache can still be loaded from L2, L3 and None - temporal for the use of the data no longer used, so L1 put on the good, was kicked out will not be loaded from L2, L3.

In the software prefetch distance graph, prefetching data too early will make the data stay in the cache for too long, and the data will be kicked out of the cache before it is really used, while prefetching data too late causes cache miss latency to occur. The equation $D \geq \left. \ \left\lceil \frac{l}{s} \right.\ \right\rceil$ shows, where l is prefetch latency, and s is the length of the shortest path through the loop body. When D (D is a[i+D] of table) is large enough to cover the cache miss latency but when D is too large it may cause the prefetch to kick out useful data from the cache and the beginning of the array may not prefetch which may cause more cache misses

Indirect memory indexing is dependent on software calculation, so we expect software prefetch will be more effective than hardware prefetch.

Hardware-baselined prefetch mechanism has a stream, GHB, and contend-based algorithm.

The good point of soft prefetch that outweighs hardware prefetch is a large number of streams

short streams, irregular Memory Access, Cache Locality Hint, and possible Loop Bounds. However, the negative impacts of software prefetching lie in increased instruction count because, it actually requires more instruction to calculate the memory offset, static insertion without adaptivity, and code structure change. For antagonistic effects, both of them have negative training and may be harmful to the original program.

In the evaluation part, the author first introduced the prefetch intrinsic insertion algorithm which first quantitatively measured the distance by IPCs and memory latency. The method has been used in C++ static prefetcher. Second, the author implemented different hardware prefetchers by MacSim and use the different programs in SPEC 2006 and compilers icc/gcc etc. to test the effectiveness of those prefetchers. They found one of the restrictions is typically thought to be the requirement to choose a static prefetch distance, The primary metric for selecting a prefetching scheme should be coverage. By using this metric, we can see that data structures with weak stream behavior or regular stride patterns are poor candidates for software prefetching and may even perform worse than hardware prefetching. In contrast, data structures with strong stream behavior or regular stride patterns are good candidates for software prefetching.

Reference

When Prefetching works, when it doesn’t and why

5分钟看完上科大4年计算机

好奇的看完莫名的来自上科大的计算机入门课CS100,后就想退学了?
上海科技大学的计算机专业只适合一部分人。这种新型学校,风险大,收益也大。
我一直认为一个学校的好坏在于你是否可以真正利用起学校的资源来实现学校名声和自我提升的双赢。
作为一个刚入校就励志想打超算比赛的人,觉得平时的“拔苗助长”还不够大跃进,于是作死多做了很多,但是收益损益很大。但是我想清楚了我想要的东西,只关注我最擅长的部分,大学学的大部分东西随着时间推移一点用都没,所以在要求动作之外,需要不沉溺于浪费时间中。一下所有计算机项目都是单人项目,所以花了很多很多时间。

Continue reading "5分钟看完上科大4年计算机"

Me as a Trans 那些可控的不可控的变化

人性无常,无法决绝的认为世界的大部分人是有共情能力的,而由于经济学的规律,最优秀的思想也会因为缺钱而停止,而天时地利人和没到的情况,是无法激起下一个状态的变化。

性别认同是一个伴随一生的东西,但是我对小裙子的渴望似乎伴随了一生,社会会禁锢思想,规训一个人的性别,和性别的预期行为。可这种东西似乎对女性保护的更好一点,至少在倡导保护女性到了极点的上海。

Continue reading "Me as a Trans 那些可控的不可控的变化"