CXL SIG

Special Interest Group on new Compute eXpress Link protocols atop the memory-optimized PCIe 5 phy. We focus broadly on topics such as disaggregated memory and coherent accelerators, studying the related evolution of processors and operating systems, and understanding how workloads will (evolve to) benefit from the rise of far memory and computational memory enabled by CXL.

This group includes external (industry) participants.

See site banner above for information about our next meeting. No CXL SIG meeting on Apr 16 due to OCP face to face.

[02/07/2024] SPECIAL INDUSTRY SESSION ON MEMORY POOLING with speakers from Google/Stanford and Microsoft/UW

06/06/2023] Call for Participation: Disaggregated Memory Workshop at Symp on OS Principles (SOSP23) link

[12/13/2022] Notes and slides from our Industry Panel on Memory Disaggregation held Nov 16, 2022 are now online.

CXL SIG Google Drive folder

2024

We meet at Tues 2-3pm Pacific time slot during Winter quarter at UCSC.

>> Apr 23: Yiwei will lead the discussion of updated results using interleaving of DDR and CXL pooled memory on Astera Labs hardware recently reported by industry authors.

Apr 9. We discussed latest updates on CXL products and prototypes

Apr 2: Tim Pezarro (Senior Product Manager at Microchip in Burnaby, Canada) joined us remotely to speak about their smart memory controllers (https://www.microchip.com/en-us/products/memory/smart-memory-controllers).

Mar 26: Salus: Efficient Security Support for CXL-Expanded GPU Memory and PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM led by Yiwei Yang

Mar 5: CXL-ANNS Pankaj and/or Jayjeet will talk about KAIST paper on billion-scale Approx Nearest Neighbor Search link from Usenix ATC 2023CXL link

Feb 20: Pankaj walked the team over GPU memory usage during long context inference in Transformer-based LLMs

Feb 13: Lokesh followed up his Dec 5 presentation with a related more recent paper from Apple "LLM in a flash: Efficient Large Language Model Inference with Limited Memory": link

Feb 7: SPECIAL INDUSTRY SESSION ON MEMORY POOLING

[5] Presenters' intro and opening Pankaj Mehra
[10+10] Hotnets'23 paper "A Case against CXL Memory Pooling" link (Philip Levis from Google, Stanford)
[20+5] IEEE Micro paper "Design Tradeoffs in CXL Based Memory Pools for Cloud Platforms" by Berger, et al. (Daniel Berger from Microsoft, UW)
[10] Moderated Q&A: Short, written, clarifying questions only. Feel free to presubmit to moderator for sharing with authors

Jan 30: Allen led a discussion on Arm's CMS presentation about area considerations of snoop filters in CXL SoCs.

Jan 23: We discussed Database Kernels: Seamless Integration of Database Systems and Fast Storage via CXL link

Jan 16: Yiwei discussed SDM: Sharing-enabled Disaggregated Memory System with Cache Coherent Compute Express Link link

Jan 9: Pooneh S. presented the gShard paper from Hot Chips 32 (2020) link We discussed the compute-memory tradeoff causing nearly 40 percent of activations to be recomputed.

2023

Dec 5: Lokesh will present a paper from Kioxia about using XLFLASH in the GPU's memory hierarchy because suppposedly GPU algorithms for graph traversal are more latency tolerant than CPU-oriented algorithms. Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa. 2023. GPU Graph Processing on CXL-Based Microsecond-Latency External Memory. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23). Association for Computing Machinery, New York, NY, USA, 962–972. https://doi.org/10.1145/3624062.3624173 link

Nov 28: Pankaj will catch the team up on OCP CMS activities planned for 2024 and the project on Acceleration Interfaces he'll lead.

Nov 14: Our own Achilles Benetopoulos will discuss "A Cloud-Scale Characterization of Remote Procedure Calls" from SOSP'23 link

Nov 7: Open discussion on how transparent is transparent page placement? What are its hidden costs?

Oct 31: Our own Yiwei Yang will present SOSP paper titled "Memtis: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination" covering profile guided and hardware assisted tiering. link

Oct 24: OCP Global Summit and Samsung Memory Tech Day recap

Oct 17: No meeting due to OCP

Oct 10: Yiwei Yang will present about Partial Failure Resilient Distributed Memory link. From the paper's abstract: CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references.

Oct 3: Yiwei Yang will present her work on CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access. Link to the paper

Sep 26: A Sea of Accelerators? Whether and what should be accelerated for data-intensive work of some of the largest services in the world. We will explore the characteristics of these workloads and the potential for accelerating them with Vidushi Dadu of Google to open our Fall Quarter meetings. Link to the paper.

June 13 : Pankaj led the discussion of the "Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators" paper from Stanford. We discussed trade-offs of push and pull memory approaches and how taking the view of the push memory approach could help simplify data movement in the systems.

May 30: Discussion of "A case for CXL-centric Server Processors" paper from Georgia Tech led by Pankaj. From the paper: "replacing all DDR interfaces to the processor with the more pin-efficient CXL interface."

We haven't done this in a while so we -- Yiwei and Pankaj -- will take a moment to review the bandwidth -- and therefore capacity -- per pin advantage of CXL versus DDR interfaces while also considering Serdes area. Interestingly, one of these links claims DDR memory uses 380 pins per channel and the other, 288 (the right answer), even though both are posted on CXL Consortium website.
Pankaj will recap briefly the latency advantages of the new native CXL IPs versus traditional PCIe for reducing RTT latency to 40ns (down from 100) by adopting a different Serdes implementation.

May 23: Pankaj previewed his International Supercomputing Conference 2023 (Hamburg) Exacomm Workshop talk on "Principles for Optimizing Data Movement in Emerging Memory Hierarchies."

May 16 Discussion of IEEE Micro paper "Design Tradeoffs in CXL Based Memory Pools for Cloud Platforms" by Berger and Ernst

May 2 (Tuesday) We will have lead author of ASPLOS23 TMTS (Transparent Memory Tiering System) paper, Priya Duraiswamy (Google), lead the discussion on their new work on a two tier memory system in which the slow tier is able to hold about 25 percent of the memory with minimal impact on performance. They use job classification to identify those jobs that can effectively use slower memory, and proactively and stably move data into the cold tier with demonstrated ability of maintain a low promotion rate, thus expected access latency across tiers.

Apr 25 (No meeting due to moderator on travel) --

Apr 18 Discussion of Asynch access to Far Memory led by Yiwei

April 11(Tuesday): Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices by Sun, Yan ; Yuan, Yifan ; Yu, Zeduo ; Kuper, Reese ; Jeong, Ipoom ; Wang, Ren ; Nam Sung Kim

Paper carefully evaluates how best to use page interleaving between a large DDR DRAM and a small CXL DRAM. It advocates using the more pipelinable non-temporal stores on SPR processors as well as offloading far memory manipulation to the new DSA. We compared against the more proactive approach such as the one Priya will describe at our May 2 meeting and found that is the more likely path for hyperscalers.

Mar 20 (Monday) an update from Andrew/Pooneh on how taking a tiered view of data heat and latency tolerance shows data-intensive applications may be able to utilize Pond-style lower tiers quite well.

Mar 13

Build upon excellent discussion of in-memory key-value stores suitable for disaggregated memory by continuing to characterize ideas from recent work that are suitable for CXL (meaning they can exploit hardware load-store data path) versus those that work with RDMA primarily.

Mar 6

Continue discussion on remoteable pointers by deep diving on Fusee (FAST'23) and WASM (Web Assembly), led by Yiwei Yang
We will be weighing implementation ideas versus 3 critical requirements of Remoteable Pointers
1. Must work from the source as pointers even when the memory is far (requires zero implementation in CXL for the most part)
2. Must work at the device for offloading pointer chasing to CXL memory device or pre-CXL memory node
3. Must work at newly started compute without the friction of serialization-deserialization for independent scaling of memory and compute

Feb 27

We focused on remoteable pointers seen in prior art such as Carbink and AIFM
We went around the room to see what other works have recently shown good implementations, and Fusee from Huawei and WASM were brought up.

Feb 20 (Monday) 1pm

Grad Student Researcher Lokesh Jaliminche led a our discussion on "Impact of CXL on Computational Storage"

Feb 13 (Monday) 1pm

We discussed the Usenix OSDI'22 Carbink paper from the bottom of this page

Feb 6 (Monday) 1 pm

Yiwei discussed the rooflines from one of the SC22 presentations continuing her talk.
Then, we talked about Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications by Sim, et al which appeared in IEEE Computer Architecture Letters Vol 22 No 1 (2023). It addresses how best to combine near-data processing and memory interleaving by architecting a simple load balancer behind low-bandwidth CXL links to have the best of both data processing bandwidth and performance/Watt, in the context of k-Nearest Neighbor as the representative memory-intensive workload.

Jan 30 (Monday) 1 pm

Yiwei continued presenting about the CXL booth at Supercomputing 2022.

Jan 23 (Monday) 1 pm

Agenda is to do quick update of CXL news and a quick roundtable to hear suggestions about talks people want to present and papers they want to be discussed this quarter (remaining 7 meetings).

Jan 16 (Monday) at 1pm

Graduate student Yiwei Yang (advised by Andrew Quinn) will discuss the design of his CXL Memory simulator and his learnings at Supercomputing 2022.

Jan 10 (Tues) Antonio Barbalace talk at 1PM (Zoom link above)

TITLE

Rethinking Systems Software for Emerging Data Center Hardware

ABSTRACT

Today’s data center hardware is increasingly heterogeneous, including several special-purpose and reconfigurable accelerators that sit along with the central processing unit (CPU). Emerging platforms include also heterogeneous memory – directly attached, NUMA, and over peripheral bus. Furthermore, processing units (CPUs and/or accelerators), pop-up in storage devices, network cards, and along the memory hierarchies (near data processing architectures). Therefore, introducing hardware topologies that didn’t exist before!

Existent, traditional, systems software has been designed and developed with the assumption that a single computer hosts a single CPU complex with direct attached memory, or NUMA. Therefore, there is one operating system running per computer, and software is compiled to run on a specific CPU complex. However, within emerging platforms this doesn’t apply anymore because every different processing unit requires its own operating system and applications, which are not compatible between each other, making a single platform look like a distributed system – even when CPU complexes are tightly coupled. This makes programming hard and hinders all of a set of performance optimizations. Therefore, this talk argues that new systems software is needed to better support emerging non-traditional hardware topologies, and introduces new operating system and compiler design(s) to achieve easier programming, and full system performance exploitation.

BIO

Antonio Barbalace is a Senior Lecturer (Associate Professor) at the School of Informatics of the University of Edinburgh, Scotland. Before, he was an Assistant Professor in the Computer Science Department, at Stevens Institute of Technology, New Jersey. Prior to that, he was a Principal Research Scientist and Manager at Huawei, German Research Center, based in Munich, Germany. He was a Research Assistant Professor, and before a Postdoc, at the ECE Department, Virginia Tech, Virginia. He earned a PhD in Industrial Engineering from the University of Padova, Italy, and an MS and BS in Computer Engineering from the same University.

Antonio Barbalace’s research interests include all aspects of system software, embracing hypervisors, operating systems, runtime libraries, and compilers/linkers, for emerging highly-parallel and heterogeneous computer architectures, including near data processing platforms and new generation interconnects with coherent shared memory. His research seeks answers about how to architect or re-architect the entire software stack to ease programmability, portability, enable improved performance and energy efficiency, determinism, fault tolerance, and security. His research work appeared at top systems venues including EuroSys, ASPLOS, VEE, ICDCS, Middleware, EMSOFT, HotOS, HotPower, and OLS.

WEBSITE

http://www.barbalace.it/antonio/

CXL SIG celebrates Daniel's graduation. Congratulations, Dr. Bittman!

Rare Talent. Dissertation Award level work. Foundational. Bold. Superatives just flew in the closed door session of the committee. I have watched Daniel take an idea and commit to it. He has been an inspiration to his fellow grad students. And has done justice to the often ignored "Ph." part of the Ph.D. degree.

Daniel's contribution to the rapidly evolving world of memory has been recognized at prestigious Usenix ATC in 2020 with a Best Presentation award. But I have realized the importance of his work in action as I work closely with major SaaS analytics vendors, major semiconductor memory suppliers, and world's leading virtualization researchers.

As "memoryness" spreads beyond RDMA in space through disaggregation and in time through persistent memory, the need to rescue translation contexts from process abstraction has become paramount. Daniel has done that by placing a foreign object table in every memory object and done for memory what S3 did for storage. We at Elephance -- where Daniel is a cofounder -- believe in this idea deeply and are committed to bring it to the world of CXL, working closely with our customers and partners in industry and government and with our growing network of academic and research collaborators.

Pankaj Mehra, President and CEO
Elephance Memory, Inc.

2022

So how bad is CXL latency, really? Find out in our newsfeed.
We will kick off 2023 with a talk by Antonio Barbalace from University of Edinburgh.
The final meeting of CXL SIG for 2022 was on Dec 6 where Yiwei Yang summarized near-memory processing for genomics.

Learnings from Existing Disaggregated Memory Systems

On Nov 29, Pankaj will discuss the recently published Samsung study of simulating a CXL attached DRAM-swapped-to-SSD as far memory and how well workloads cope with it. See a quick summary in our CXL in the News subpage. (slides)
On Nov 22, Andrew Quinn discussed Pond. Pond builds on a previously released manuscript from the same authors. The system studies memory pooling for increasing DRAM utilization in data centers and thereby reducing the cost of using and maintaining main memory. In particular, Pond looks at implementing memory pools using the CXL standard. The paper first analyzes cloud production traces to show that small-scale memory pooling (i.e., across only 8-16 sockets) is sufficient to achieve most of the cost benefits from memory pooling. They then show that a machine learning model can accurate predict the memory allocation size required for a black-box application. Pond would decrease DRAM costs by 7% with performance that is within 1-5% of standard systems (i.e., same-NUMA-node allocations).
On Nov 8th, Pankaj led the discussion on this Microsoft Research position paper presented at HotOS 2020 titled Disaggregation and the Application by Sebastian Angel from UPenn and Mihir Nanvati and Siddhartha Sen from MSR. paper
On Oct 25th, Pankaj led the first of several discussions on existing disaggregated memory systems used by cloud service providers
- Microsoft Azure CompuCache: Remote Computable Caching using Spot VMs CIDR’22 paper by Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, Badrish Chandramouli, Vincent Liu, and Boon Thau Loo. paper (Pankaj's slides)

Workloads

On Nov 1st, we had another presentation from Pooneh about results from her research on workloads and working sets first shared by her briefly on Sep 6

Microarchitecture Co-evolution with CXL

On October 4th and 18th, we revisited a topic (see this folder) we started discussing on Sep 13 when the "A Case against (most) context switches link" paper was brought up by Andrew. From reading that paper we learned how CPU microarchitecture could evolve in response to longer memory latencies. That made me wonder what else is out there and one new and one old paper jumped to the top of my reading pile.
- Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy ASPLOS'19) link (Slides presented by Pankaj on Oct 4)
- täkō: A Polymorphic Cache Hierarchy for General-Purpose Optimization of Data Movement (ISCA-22 Best Paper nominee!) link (Slides presented by Pooneh)

Linux HMM scope, co-evolution with CXL, and limitations

Basics
Discussion of Sep 20 led by Yiwei (SMDK & HMSDK)
- Yiwei discussed the stack below.

On Sep 27, 2022 Yiwei updated with details showing where jemalloc and libnuma hook into SMDK. We need to run this by Samsung collaborators for accuracy and to better understand their roadmap (link to Yiwei's SMDK deep dive presentation)

- Frank Hady (Intel) asked whether Linux stack needs to be modified to use CXL. HMM is evolving to get there. Current stacks are using a variety of emulation techniques and I/O mechanisms as well as PMDK and DAX remnants. Native CXL Linux stack will emerge from these experiences.
Discussion of Sep 13 led by James/Pankaj/Andrew (Future CPUs; CXL software ecosystem more broadly than just HMM)
- Paper discussed: A Case Against (Most) Context Switches link
- UCSC-only content discussed: Workloads spreadsheet UCSC-internal link
Discussion of Sep 6 led by Pooneh (HMM)

Past Topics

Why CXL? Will it make sense despite added latency and cost in the memory access path relative to DDR DRAM?

Document WIP

CXL SIG

CXL SIG Google Drive folder

Jan 30 (Monday) 1 pm

Jan 23 (Monday) 1 pm

Jan 16 (Monday) at 1pm

CXL SIG celebrates Daniel's graduation. Congratulations, Dr. Bittman!

2022

Learnings from Existing Disaggregated Memory Systems

Workloads

Microarchitecture Co-evolution with CXL

Linux HMM scope, co-evolution with CXL, and limitations

Past Topics

Why CXL? Will it make sense despite added latency and cost in the memory access path relative to DDR DRAM?

CXL SIG

CXL Basics

CXL Applications