About the MHSR of the LLC miss with CXL.mem devices

In [1], the author talked about the Asynchronous Memory Unit that the CPU and Memory controller needs to support of co-design.

The overhead of hardware consistency checking is one reason that limits the capacity of traditional load/store queues and MSHRs. The AMU leaves the consistency issue to the software. They argue that software and hardware cooperation is the right way to exploit the memory parallelism over large latency for AMU.

As shown in the Figure of sensitivity tests in [2], the decomposition analysis of DirectCXL shows a completely different result: no software and no data copy overhead. As the payload increases, the main component of the DirectCXL latency is the LLC (CPU Cache). This is because the Miss State Holding Register (MSHR) in the CPU LLC can handle 16 concurrent misses, so with large payload data, many memory requests (64B) are suspended on the CPU, and processing a 4KB payload takes up 67% of the total latency.

The conclusion is MHSR inside the CPU is not enough to deal with memory load in the CXL.mem world, and both the latency and the bandwidth are so diverse across the serial PCIe5 lane. Also, another possible outcome compared with RDMA SRQ approach of the controller, we think the PMU and semantics of coherency still matter and the future way of persistency according to the Huawei's approach and SRQ approaches will fall back to ld/st but with a smarter leverage in the MC that asynchronously ld/st the data.


  1. Asynchronous memory access unit for general purpose processors
  2. Direct Access, High-Performance Memory Disaggregation with DirectCXL

Copy-on-Pin: The Missing Piece for Correct Copy-on-Write @ASPLOS’23

Nadav has been enumerating the Intel extensions providing support for virtualization for VMware and providing security mitigation or debugging applying for the Intel extensions. And provides things like userspace memory remote paging [2] for providing VMware a better service disaggregation technology. They've been investigating the vulnerability of IOMMU with the DMA [1] and remote TLB shootdown performance bugs(updating the page table will incur TLB shootdown) by introducing con-current flushing, early acknowledgment, cacheline consolidation, and in-context TLB flushes.

This paper examines the interaction between COW and pinned pages, which are pages that cannot be moved or paged out to allow the OS or I/O devices to access them directly.

Basically, we need a COW-share prevention on the pinned page. The Missing Piece for Correct Copy-on-Write which considers how COW interacts with other prevalent OS mechanisms such as POSIX shared mapp1ings, caching references, and page pinning. It defines an invariant that indicates if there is any private writable mapping it must be a single exclusive mapping and provides test cases to evaluate COW and page pinning via O_DIRECT read()/write() in combination with fork() and write accesses.

For implementation, they made a tool similarly to dynamic taint analysis that mark an exclusive flag for page(possibly of CXL to make a hardware software codesign of this, but in a cacheline or page granularity). This flag also introduces refinements to avoid unnecessary copies and handles swapping, migration and read-only pinning correctly. An evaluation of the performance of RelCOP compared to two prior COW handling schemes shows that it does not introduce noticeable overheads. An implementation of this design was integrated into upstream Linux 5.19 with 747 added and 340 removed lines of code. Evaluation results show that RelCOP performs better than PreCOP by up to 23% in the sequential access benchmark and 6% in the random access benchmark without introducing noticeable overheads.


  1. Characterizing, Exploiting, and Detecting DMA CodeInjection Vulnerabilities in the Presence of an IOMMU @Eurosys'20
  2. https://patentimages.storage.googleapis.com/74/32/e2/d300f0489ffc90/US20220398199A1.pdf
  3. Don't shoot down TLB shootdowns!

Moving Disaggregation to CXL

Today, after listening to the latest pre-CXL work of RDMA like carbink, AIFM, compucache, infiniswap, fastswap, memliner/ clover, dinomo/ RACE Hashing, sherman, fusee. I'm wondering much disaggregated memory has been deployed on the RNIC manner.

We will be weighing implementation ideas from research papers versus 3 critical requirements of Remoteable Pointers

  1. Must work from the source as pointers even when the memory is far (requires zero implementation in CXL for the most part)
  2. Must work at the device for offloading pointer chasing to CXL memory device or pre-CXL memory node
  3. Must work at newly started compute without the friction of serialization-deserialization for independent scaling of memory and compute

Is Phantom address a good solution?

Is wasm a good solution?


  1. InfiniFilter: Expanding Filters to Infinity and Beyond @SIGMOD'23
  2. Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory @SIGMOD'22

Is MMAP still good for Post CXL era?

A short answer is no.

  1. MMAP a huge file need OS to register a virtual address to mmap the file on; once any request to the file is made, we may use page fault to load the file from disk to the private DRAM and setup the va_to_pa and buffer the file part in the DRAM, maybe use TLB to cache the next read. Every CXL device has it own mapping of memory; if you MMAP memory that was swapped onto CXL.mem devices like memory semantic SSD, the controller of SSD may decide whether to put on on-SSD DRAM or SSD and, in the backend, write through everything on physical media. CXL vendors drastically want to implement the defered allocation that lazily setup the physical memory to the virtual mmemory, which overlaps the MMAP mechenism.
  2. MMAP + madvise/numabind to certain CXL attached memory may cause migration efforts. Once you dirty write the pages, the transaction is currently not yet introduced in the CXL protocol. The process takes pains to implement the mechesim correctly. Instead, we can do something like TPP or CXLSwap, making everything transparent to applications. Or, we can make 3D memory and extend computability in CXL controller to decide where to put the data and maintain the transaction under the physical memory.
  3. MMAP is originally designed for a fast track memory together with a slower track disk like HDDs. Say you are loading graph edges from a large HDD backed pool. The frequently accessed part will be softwarely defined as a stream pool for cold/hot data management. Here MMAP can both leverage the OS page cache semantic transparently, but it's not case with more and faster endpoints. With more complexity of topology of CXL NUMA devices, we could handle fewer error at a time and serve more for the speed of main bus. Thus, we don't stop for page fault and requires those be handled in endpoints side.

Thus we still need SMDK such management layer to make jemalloc+libnuma+CXLSwap for CXL.mem. For interface with CXL.cache devices, I think defer allocation and managing everything through virtual memory would be fine. Thus we don't need programming models like CUDA; rather, we can static analysis through MLIR to do good data movement hint to every CXL controller's MMU and TLB. We could leverage CXL.cache cacheline state to treat as streaming buffer so that every possible endpoints read for and then do updates by next write.


  1. https://db.cs.cmu.edu/mmap-cidr2022/
  2. https://blog.csdn.net/juS3Ve/article/details/90153094

Oblicious RAM survey

上了节Applied Cryptology的课。这老板和一切体系结构老师一样,非常push你做他想做的东西。大概密码学的研究,要不是一个全过程定义like CryptDB这样,或者弱化一个密码学的property优化性能,或者优化加密复杂度。一般都是找一个dummy implementation然后慢慢优化。

  1. More secure
  2. More Efficient
  3. More Expressive and functional
    1. Various query types (Boolean, point, range, join, group-by,...)
    2. Dynamic query workloads
    3. Specialized for various DB scenarios (relational, graph, array DBs, ...)

Cryptography Recap

Crypto Property

  • One-time pad, the easiest one is ⊕

  • RSA
  • Pseudo-Random Function
  • Privacy-Preserving
    • if (a==b) then Enc(a) = Enc(b) - Deterministic encryption
    • if (a<=b) then Enc(a) <= Enc(b) - Order preserving encrytion
  • CPA-security(Chosen-plaintext attack)
    • The CPA Indistinguishability Experiment $PrivK^{cpa}_{A,\pi}$ < perfect
  • Dynamic searchable encryption is CPA

- SDa
- High-level idea: Organize Nupdates in a collection of at most $log_2N$ independent encrypted indexes

Can Reduce from amortized to de-amortized by constructing dummy code and sorting and obliviously make it work.

  • Oblivious algorithm

    • The algorithm of the 1sort does not hurt the obvious property like Bitonic Sort takes $Nlog^2N$ cost
    • Oblivious sorted multimaps AVL Tree
  • MPC

  • TEE doesn't need every time download everything, but makes client inside the enclave and exchange data with outer devices. It can save transmission time.

  • ORAM makes the access pattern to the RAM not leaked by the attacker

  • Backward Privacy


用cyphertext对search scheme的(de)encrypt,可以想像就是死慢。有很多rebalance/重排操作优化

Searchable Encryption

怎么保证在search的过程中没有任何信息被leak,尤其是access pattern


Security 喜欢干的也是写模拟器。



在模拟器里比较好测试算法的有效性。证明property like adaptive indistinguishability.

Whether construct all the key and pad to size of power of 2 will reduce the database result-size leakage?

What about adding random (key, value) pairs to $T$, so that the total number of elements is 2*N,

Optimal or Quasi-Optimal Search Time

Use ORAM Map + Backward privacy

I/O Efficient Searchable Encryption

Define Locality and Read Efficiency

Onechoice Allocation

Lower the Property that has a minor possibility of leaking the access pattern - Page Efficient SSE.

  • Uniform random document/tuple id reassignment
  • Compress the index
  • Encrypt the compressed index

Private Range Queries

Key Idea is to transform the range queries into point queries.

Leakage Abuse Attacks


基本思想是在你的ORAM放一个aside ORAM,用一个散列函数分成俩$^n$倍。

Can leak the private join operation

ORAM introduction

A simple ORAM using Shuffling

This Oblivious Shuffling basically tell use the mapping will cost O(log N) for storage O(1), O(N) for storage O(log N) and O(N log N) for storage(N1)


shuffle有很多variant, double buffer amortized square root shuffle and cubic。安全证明是对$\sqrt N$的操作寻找整个data slot的期望是正好的。所以buffer可以到$\sqrt N$。

每次读,除了第一层都用PRF $K_i$ 拿到结果,然后put to first。第i层每$2^i$次操作reshuffle一次。

Existing ORAM database




这个DB是Rust写的。想法是把client 放在SGX里面,做到双向Olivious,即ORAM clients make data-independent accesses to client memory (within enclave),让server和client相互看不到access pattern。PATH ORAM的方法是把stash data的position map放在enclave client里,item放在不安全的binary tree上,通过trace拿到结果,这个操作是singly ORAM。

DORAM 说的是通过Block和fetched Block分离的方式读在server端的trace,
Read:get一次path替换dummy,添加block,添加dummy path到buckets里。
Write Back:



Attacks cast on making ORAM non-malicious



  1. https://people.eecs.berkeley.edu/~raluca/oblix.pdf
  2. https://arxiv.org/pdf/2106.09966.pdf
  3. https://www.cs.umd.edu/~jkatz/papers/sqoram.pdf
  4. https://cseweb.ucsd.edu//~cdcash/oram-slides.pdf
  5. https://keystone-enclave.org/open-source-enclaves-workshop/slides/OSEW19_RohitSinha_VisaResearch.pdf

Paper List

  1. RingORAM---Constants Count: Practical Improvements to Oblivious RAM
  2. Path Oblivious Heap: Optimal and Practical Oblivious Priority Queue
  3. Bucket Oblivious Sort: An Extremely Simple Oblivious Sort
  4. Fast Fully Oblivious Compaction and Shuffling
  5. Oblix: An Efficient Oblivious Search Index
  6. Snoopy: Surpassing the Scalability Bottleneck of Oblivious Storage
  7. Opaque: An Oblivious and Encrypted Distributed Analytics Platform
  8. Efficient Oblivious Database Joins
  9. Pancake: Frequency smoothing for encrypted data stores
  10. Snapshot-Oblivious RAMs: Sub-Logarithmic Efficiency for Short Transcripts
  11. SHORTSTACK: Distributed, Fault-tolerant, Oblivious Data Access
  12. Meltdown: Reading Kernel Memory from User Space
  13. Observing and Preventing Leakage in MapReduce ̊


The comparison between RMA based memory disaggregation and CXL.mem based memory disaggregation.

The span+coherency state in Carbink is just like cacheline coherency in CXL.mem but except that if two threads contention on one span it will go back and forth, that's the charm of cachable that CXL don't need the cacheline be transmitted but they are registered in the window of local LLC.

A lot of the software optimization is based on the panelty of small chunks transmission of RDMA is too huge that if we replace with CXL, we don't need to care ptr serialization and relinking because they are in the same memory space. maintaining a metadata of pages is still a huge overhead. The local page map is a two-level radix tree. The lookup process is similar to a page table walk: the first 20 bits of the object's virtual address are indexed to the first level radix tree table, and the next 15 bits are indexed to the second level table. The same mapping method allows Carbink to map the virtual address of a locally-resident span to its metadata. Thus this paper in era of CXL is useless, nothing to refer.

The difference of EC-Split(their implementation of Hydra) and EC-Batch is the critical path of the memory transaction. To reconstruct a single span, a compute node must contact multiple memory nodes to pull in all the required fragments. This requirement to contact multiple memory nodes makes the swap operation vulnerable to deviators, thus increasing the tail latency. And their compaction and de-fragmentation approach is to save the remote data usage but has no upgain for performance actually for their local vs remote upper than 50%. They only gain 10% for more on local side by the hiding of the span swap operations.


  1. https://www.usenix.org/conference/osdi22/presentation/zhou-yang
  2. https://www.google.com/search?q=hydra+fast+21&oq=hydra+fast+21&aqs=chrome..69i57j33i299l3j33i22i29i30i625l6.4597j1j4&sourceid=chrome&ie=UTF-8

WebAssembly Micro Runtime Internals

This document is WIP.


First, init with memory allocation on options.

You can define the memory serialized data section in the same place and first initialize them into desired memory format.


The wasm layer provides a interface for tensorflow to call on, which is like a very light weight abi for a better backend to codegen.

graph-builder: list<u8>
graph-builder-array: list<graph-builder>


  1. https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

R2: An Application-Level Kernel for Record and Replay

When I reviewed the paper in the past, I was surprised that the recently proposed plans like JIT, MLIR, and eBPF could be a great fit for the legacy tools like record and replay and security live patching or kernel modeling.


  1. https://www.usenix.org/legacy/event/osdi08/tech/full_papers/guo/guo.pdf
  2. https://nimrodpar.github.io/assets/publications/rr.pdf
  3. https://iacoma.cs.uiuc.edu/iacoma-papers/hpca18.pdf