CXL SIG

Special Interest Group on new Compute eXpress Link protocols atop the memory-optimized PCIe 5 phy. We focus broadly on topics such as disaggregated memory and coherent accelerators, studying the related evolution of processors and operating systems, and understanding how workloads will (evolve to) benefit from the rise of far memory and computational memory enabled by CXL.

This group includes external (industry) participants.

See site banner above for information about our next meeting. No CXL SIG meeting on Apr 16 due to OCP face to face.

[02/07/2024] SPECIAL INDUSTRY SESSION ON MEMORY POOLING with speakers from Google/Stanford and Microsoft/UW

06/06/2023] Call for Participation: Disaggregated Memory Workshop at Symp on OS Principles (SOSP23) link

[12/13/2022] Notes and slides from our Industry Panel on Memory Disaggregation held Nov 16, 2022 are now online.

CXL SIG Google Drive folder

CXL SIG Google Drive folder

2024

We meet at Tues 2-3pm Pacific time slot during Winter quarter at UCSC.


>> Apr 23: Yiwei will lead the discussion of updated results using interleaving of DDR and CXL pooled memory on Astera Labs hardware recently reported by industry authors.


Apr 9. We discussed latest updates on CXL products and prototypes


Apr 2: Tim Pezarro (Senior Product Manager at Microchip in Burnaby, Canada) joined us remotely to speak about their smart memory controllers (https://www.microchip.com/en-us/products/memory/smart-memory-controllers).


Mar 26: Salus: Efficient Security Support for CXL-Expanded GPU Memory and PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM led by Yiwei Yang 


Mar 5: CXL-ANNS Pankaj and/or Jayjeet will talk about KAIST paper on billion-scale Approx Nearest Neighbor Search link from Usenix ATC 2023CXL  link 

Feb 20: Pankaj walked the team over GPU memory usage during long context inference in Transformer-based LLMs

Feb 13: Lokesh followed up his Dec 5 presentation with a related more recent paper from Apple "LLM in a flash: Efficient Large Language Model Inference with Limited Memory": link

Feb 7: SPECIAL INDUSTRY SESSION ON MEMORY POOLING

Jan 30: Allen led a discussion on Arm's CMS presentation about area considerations of snoop filters in CXL SoCs.

Jan 23: We discussed Database Kernels: Seamless Integration of Database Systems and Fast Storage via CXL link

Jan 16: Yiwei discussed SDM: Sharing-enabled Disaggregated Memory System with Cache Coherent Compute Express Link link

Jan 9: Pooneh S. presented the gShard paper from Hot Chips 32 (2020) link We discussed the compute-memory tradeoff causing nearly 40 percent of activations to be recomputed.

2023

Dec 5: Lokesh will present a paper from Kioxia about using XLFLASH in the GPU's memory hierarchy because suppposedly GPU algorithms for graph traversal are more latency tolerant than CPU-oriented algorithms. Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa. 2023. GPU Graph Processing on CXL-Based Microsecond-Latency External Memory. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23). Association for Computing Machinery, New York, NY, USA, 962–972. https://doi.org/10.1145/3624062.3624173 link

Nov 28: Pankaj will catch the team up on OCP CMS activities planned for 2024 and the project on Acceleration Interfaces he'll lead.

Nov 14: Our own Achilles Benetopoulos will discuss "A Cloud-Scale Characterization of Remote Procedure Calls" from SOSP'23   link

Nov 7: Open discussion on how transparent is transparent page placement? What are its hidden costs? 

Oct 31: Our own Yiwei Yang will present SOSP paper titled "Memtis: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination" covering profile guided and hardware assisted tiering. link

Oct 24: OCP Global Summit and Samsung Memory Tech Day recap

Oct 17: No meeting due to OCP

Oct 10: Yiwei Yang will present about Partial Failure Resilient Distributed Memory link. From the paper's abstract: CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references

Oct 3: Yiwei Yang will present her work on CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access. Link to the paper

Sep 26: A Sea of Accelerators? Whether and what should be accelerated for data-intensive work of some of the largest services in the world. We will explore the characteristics of these workloads and the potential for accelerating them with Vidushi Dadu of Google to open our Fall Quarter meetings. Link to the paper.

==

June 13 : Pankaj led the discussion of the "Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators" paper from Stanford. We discussed trade-offs of push and pull memory approaches and how taking the view of the push memory approach could help simplify data movement in the systems.    

May 30: Discussion of "A case for CXL-centric Server Processors" paper from Georgia Tech led by Pankaj. From the paper: "replacing all DDR interfaces to the processor with the more pin-efficient CXL interface."

May 23: Pankaj previewed his International Supercomputing Conference 2023 (Hamburg) Exacomm Workshop talk on "Principles for Optimizing Data Movement in Emerging Memory Hierarchies."

May 16 Discussion of IEEE Micro paper "Design Tradeoffs in CXL Based Memory Pools for Cloud Platforms" by Berger and Ernst

May 2 (Tuesday) We will have lead author of ASPLOS23 TMTS (Transparent Memory Tiering System) paper, Priya Duraiswamy (Google), lead the discussion on their new work on a two tier memory system in which the slow tier is able to hold about 25 percent of the memory with minimal impact on performance. They use job classification to identify those jobs that can effectively use slower memory, and proactively and stably move data into the cold tier with demonstrated ability of maintain a low promotion rate, thus expected access latency across tiers.

Apr 25 (No meeting due to moderator on travel) -- 

Apr 18 Discussion of Asynch access to Far Memory led by Yiwei

April 11(Tuesday): Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices by Sun, Yan ; Yuan, Yifan ; Yu, Zeduo ; Kuper, Reese ; Jeong, Ipoom ; Wang, Ren ; Nam Sung Kim

Paper carefully evaluates how best to use page interleaving between a large DDR DRAM and a small CXL DRAM. It advocates using the more pipelinable non-temporal stores on SPR processors as well as offloading far memory manipulation to the new DSA. We compared against the more proactive approach such as the one Priya will describe at our May 2 meeting and found that is the more likely path for hyperscalers.

Mar 20 (Monday) an update from Andrew/Pooneh on how taking a tiered view of data heat and latency tolerance shows data-intensive applications may be able to utilize Pond-style lower tiers quite well.

Mar 13

Mar 6

Feb 27

Feb 20 (Monday) 1pm

Feb 13 (Monday) 1pm

Feb 6 (Monday) 1 pm

Jan 30 (Monday) 1 pm

Jan 23 (Monday) 1 pm

Agenda is to do quick update of CXL news and a quick roundtable to hear suggestions about talks people want to present and papers they want to be discussed this quarter (remaining 7 meetings).

Jan 16 (Monday) at 1pm

Graduate student Yiwei Yang (advised by Andrew Quinn) will discuss the design of his CXL Memory simulator and his learnings at Supercomputing 2022.

Jan 10 (Tues) Antonio Barbalace talk at 1PM (Zoom link above)

TITLE

Rethinking Systems Software for Emerging Data Center Hardware     

ABSTRACT

Today’s data center hardware is increasingly heterogeneous, including several special-purpose and reconfigurable accelerators that sit along with the central processing unit (CPU). Emerging platforms include also heterogeneous memory – directly attached, NUMA, and over peripheral bus. Furthermore, processing units (CPUs and/or accelerators), pop-up in storage devices, network cards, and along the memory hierarchies (near data processing architectures). Therefore, introducing hardware topologies that didn’t exist before!

Existent, traditional, systems software has been designed and developed with the assumption that a single computer hosts a single CPU complex with direct attached memory, or NUMA. Therefore, there is one operating system running per computer, and software is compiled to run on a specific CPU complex. However, within emerging platforms this doesn’t apply anymore because every different processing unit requires its own operating system and applications, which are not compatible between each other, making a single platform look like a distributed system – even when CPU complexes are tightly coupled. This makes programming hard and hinders all of a set of performance optimizations. Therefore, this talk argues that new systems software is needed to better support emerging non-traditional hardware topologies, and introduces new operating system and compiler design(s) to achieve easier programming, and full system performance exploitation.

BIO

Antonio Barbalace is a Senior Lecturer (Associate Professor) at the School of Informatics of the University of Edinburgh, Scotland. Before, he was an Assistant Professor in the Computer Science Department, at Stevens Institute of Technology, New Jersey. Prior to that, he was a Principal Research Scientist and Manager at Huawei, German Research Center, based in Munich, Germany. He was a Research Assistant Professor, and before a Postdoc, at the ECE Department, Virginia Tech, Virginia. He earned a PhD in Industrial Engineering from the University of Padova, Italy, and an MS and BS in Computer Engineering from the same University.

Antonio Barbalace’s research interests include all aspects of system software, embracing hypervisors, operating systems, runtime libraries, and compilers/linkers, for emerging highly-parallel and heterogeneous computer architectures, including near data processing platforms and new generation interconnects with coherent shared memory. His research seeks answers about how to architect or re-architect the entire software stack to ease programmability, portability, enable improved performance and energy efficiency, determinism, fault tolerance, and security. His research work appeared at top systems venues including EuroSys, ASPLOS, VEE, ICDCS, Middleware, EMSOFT, HotOS, HotPower, and OLS.

WEBSITE

http://www.barbalace.it/antonio/



CXL SIG celebrates Daniel's graduation. Congratulations, Dr. Bittman!

Rare Talent. Dissertation Award level work. Foundational. Bold. Superatives just flew in the closed door session of the committee. I have watched Daniel take an idea and commit to it. He has been an inspiration to his fellow grad students. And has done justice to the often ignored "Ph." part of the Ph.D. degree.

Daniel's contribution to the rapidly evolving world of memory has been recognized at prestigious Usenix ATC in 2020 with a Best Presentation award. But I have realized the importance of his work in action as I work closely with major SaaS analytics vendors, major semiconductor memory suppliers,  and world's leading virtualization researchers.

As "memoryness" spreads beyond RDMA in space through disaggregation and in time through persistent memory, the need to rescue translation contexts from process abstraction has become paramount. Daniel has done that by placing a foreign object table in every memory object and done for memory what S3 did for storage. We at Elephance -- where Daniel is a cofounder -- believe in this idea deeply and are committed to bring it to the world of CXL, working closely with our customers and partners in industry and government and with our growing network of academic and research collaborators.

Pankaj Mehra, President and CEO
Elephance  Memory, Inc.

2022

Learnings from Existing Disaggregated Memory Systems

Workloads

Microarchitecture Co-evolution with CXL

Linux HMM scope, co-evolution with CXL, and limitations

On Sep 27, 2022 Yiwei updated with details showing where jemalloc and libnuma hook into SMDK. We need to run this by Samsung collaborators for accuracy and to better understand their roadmap (link to Yiwei's SMDK deep dive presentation)

Past Topics

Why CXL? Will it make sense despite added latency and cost in the memory access path relative to DDR DRAM? 

Document WIP