The comparison between RMA based memory disaggregation and CXL.mem based memory disaggregation.

The span+coherency state in Carbink is just like cacheline coherency in CXL.mem but except that if two threads contention on one span it will go back and forth, that's the charm of cachable that CXL don't need the cacheline be transmitted but they are registered in the window of local LLC.

A lot of the software optimization is based on the panelty of small chunks transmission of RDMA is too huge that if we replace with CXL, we don't need to care ptr serialization and relinking because they are in the same memory space. maintaining a metadata of pages is still a huge overhead. The local page map is a two-level radix tree. The lookup process is similar to a page table walk: the first 20 bits of the object's virtual address are indexed to the first level radix tree table, and the next 15 bits are indexed to the second level table. The same mapping method allows Carbink to map the virtual address of a locally-resident span to its metadata. Thus this paper in era of CXL is useless, nothing to refer.

The difference of EC-Split(their implementation of Hydra) and EC-Batch is the critical path of the memory transaction. To reconstruct a single span, a compute node must contact multiple memory nodes to pull in all the required fragments. This requirement to contact multiple memory nodes makes the swap operation vulnerable to deviators, thus increasing the tail latency. And their compaction and de-fragmentation approach is to save the remote data usage but has no upgain for performance actually for their local vs remote upper than 50%. They only gain 10% for more on local side by the hiding of the span swap operations.



Dagger: Efficient and Fast RPCs in Cloud Microservices with Near-Memory Reconfigurable NICs @ASPLOS21

The current RPC layer between nodes can be scaled by utilizing NUMA variants like NEBULA [1] or soNUMA, which provides on-chip buffer locality awareness of the data movement. A lot of RDMA RPC Accelerator/ SmartNIC implements these kinds of locality information when sending on LLC or on RDMA Shared Receive Queue(to enable the inter-endpoint buffer sharing).

For tail latency that newly arriving RPCs will inevitably violate the SLO and are eagerly NACKed, informing the client early about increased load conditions on the server, NEBULA will reject or retry(fail-fast approach) the requests predicted. The NIC to core policy is implemented naturally using DDIO+fast packet reassembly, but this dispatch is not cached into L1 directly.

Dagger implemented FPGA reconfigurable RPC stack-integrated microservices. Their eventual goal is to enable NUMA(future support CXL) over FPGA RPC. Their initiative is still if the microservice like NGINX/MongoDB is not latency bound and memory bound, just disaggregate using their NIC to the same memory space. The RPC is the prevailing IDL gRPC. The CPU/NIC communication is very similar to the approach in the HFT industry DDIO+cache hash function+avx stream load cl, WQE-by-MMIO.

Their load balancer+ flow scheduler on FPGA is very similar to a CXL.cache device bias CHA. That CHA normally distributes Req on-chip or cross Socket through UPI.

When CXL FPGA NIC's in, I think it can cache cl onto L1 and use hw to replace the cc. I still think there's more OS-level sync that can leverage the cc of CXL.cache.



[CSE231 Paper Reading] Filesystem Reflection

SSDs and HDDs design difference.

We described two approaches to building file systems, clustering, and logging, for Hard Disk Drives (HDDs). How should we design a file system for a Solid State Drive (SSD)? Your answer should identify two hardware properties of SSDs that contrast with the properties of HDDs and describe how these properties inform your design decisions.

In the first FS prototype of UNIX, FFS, we used inode and dirent for interpreting the abstraction for file, those designs are aware of the physical media of the HDDs of how addressing a space will be faster and the page size for balancing swapping the physical memory onto the disks. This type of filesystem is not designed for SSDs. The SSDs have NAND QLC/SLC/MLC as physical media to store files and the on-chip controller is equipped with the ability of LBA to page map translation, drive channels, write-behind, allocation, and wear leveling (considered harmful in [1]). Nowadays, the firmware FS is designed similarly to Log-structured FS because it can fast recover from physical media failure with low latency cost.

EXT4 is always evolving with new media and smarter design. Ramzi group from the University of Wisconsin still publishes papers on EXT4 core design like leveraging device mapper deduplication [2], and how data path can be more critical [3].

Starting from PMFS, people are getting close to how FS can be designed if the persistent memory is attached as memory or more recently as a CXL.mem. 1. with the hardware speeding up, we require software to be low latency as well. e.g. tracer, hypervisor, and multi-tenancy isolation management 2. the ordering and durability of a single cacheline or a single page, or a single pointer should be dealt in the firmware level, either resolved by locks, lock-free hardware mechanism, or other smart API support. 3. workloads aware optimization by providing performance hints and advise interface like madvise() in Linux. The storage of ML model serialization is completely different from those configured to be plain memory(no ptr or hierarchical logic) dump. 4. optimize consistency using a combination of atomic in-place updates, logging at cacheline granularity (fine-grained journaling), and copy-on-write (CoW)

Replace POSIX?

In our discussion of single-level stores, we identified a few problems with using this design for all interactions with persistent storage. In this question, describe one potential issue that arises when using a single-level store for persistent storage. How does using a POSIX/file system interface simplify/eliminate the issue that you described?

The VFS is long discussed overhead inside the kernel that causes higher latency by random scheduler overhead or lower bandwidth by blk or other mechanisms of kernel occupying the queue. But we don't currently have a proposal to replace it since the kernel must rely on the blk pread/pwrite & bdi write back all this stuff. [4] provides

The semantics for POSIX is too strong for distributed FS. If you could put something on the fly in the CXL cache layer, that is bounded for data appearance for already ready. that will be

/ Software-defined SmartSSD/ CXLSSD



Achieving 100Gbps Intrusion Prevention a Single Server @OSDI' 20

FPGA offload 入侵检测, performance/power efficiency balance。这篇的insight就是要做above 100Gbps TCP的on chip计算,比如负载均衡、能量负载、安全等如果放在CPU或者PIM上算都太慢了,所以搞了这个near NIC的computation。在交易所网络包发送的过程中也有类似的需要更改简单逻辑的场景,运用smartNIC在保证volatility的条件下可以大大减少延时。这篇的第二个insight是用到了intel hyperscan尽可能software 提速匹配IDS/IPSA。

第三个insight是硬件调度优化三个操作Regex matching rules/TCP reassemble/other task。

其中full matcher轮询一个由DMA engine填充的环形缓冲器。每个数据包都携带元数据,包括MSPM确定为部分匹配的规则ID(hyper scan的后半部分)。对于每个规则ID,完全匹配器检索完整的规则(包括正则表达式)并检查是否完全匹配。

TCP resembler 是一个ooo的设计。packets会先渠道fast path, 再到一个bram的cuckoo hashing table(flow table), insertion table 会弥补不同执行时间的ooo engine。

为了减少在FPSM的hash table lookup,其还写进去了一个SIMD shift or matching(显然应该不会比商用的intel的fpu写的快。(不过在FPGA上塞这么多逻辑.

täko: A Polymorphic Cache Hierarchy forGeneral-Purpose Optimization of Data Movement

(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.

They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss() onEviction() writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.

In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.

However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.



[CSE231 Paper Reading] Log Structure Filesystem


Computer hardware performance began to explode: CPUs were getting faster and RAM was getting bigger; however, while sequential read and write speeds of hard disks were increasing, random read and write speeds, which were limited by physical seek times, were hardly shorter than 10 ms. On the other hand, file systems at the time, whether Unix File System or FFS, had a large number of random reads and writes (at least 5 random writes are required to create a new file in FFS), thus becoming a performance bottleneck for the whole system. At the same time, because of Page cache, the authors argue that random reads are not the main problem: with more and more memory, most of the reads can be cached, so the main problem of LFS is to reduce random writes to the hard disk.


  1. File System as a Log
  2. Segment-based bulk writing solves the random write problem, but how does LFS implement read operations? Similar to UFS/FFS, LFS stores the contents of a file within a segment, and also stores the index of the file. Specifically: in Segment, the file contents are stored in a fixed-size data block. Segment0 stores the two data blocks of file2, and the subsequent inode2 stores the indexes of these two data blocks. However, unlike UFS/FFS, LFS inodes are dynamically allocated, so LFS stores an index to the inode at the end of each Segment, called the inode map. in LFS, all inode map contents are cached in the contents, which speeds up reads.
  3. Garbage collection: As mentioned earlier, LFS requires a garbage collection mechanism designed to remove old data. In LFS, multiple segments containing obsolete data blocks are compacted into new data segments, and the old data in them is deleted.
  4. Failure recovery is obviously important for any file system to be able to recover data from a failure. Unlike UFS, which uses the fsck command to recover from a failure, LFS stores a checkpoint for the entire drive: because each Segment of LFS stores the address of the next Segment, the entire file system is organized like a chain. In Checkpoint, LFS stores the address of the first Segment and the last Segment of this chain, so the entire file system can be recovered by reading Checkpoint. lfs updates the data in Checkpoint every 30 seconds.


Figure 3/4 shows the disk capacity utilization and fraction alive the segment cleaned with write cost are better than FFS today for uniform data and 10% hot data and 90% cold data. Figure 5 shows that in terms of segment utilization, the hot-and-cold data are better than uniform. Figure 6 shows that the segment distribution occurs when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written.

Evaluation effectiveness

This evaluation is designed sophistically.


Rosenblum is a co-founder of Vmware and Ousterhout is one of the authors of Raft. I don't have any comments on their novelty, just they are tooooo ahead of time.

SoTA Connection

The Segment-based design of LFS coincides with the physical characteristics of SSDs and is therefore widely used in SSD firmware; the memory table/compaction of LSM is in line with the memory buffer and GC of LFS, and new file systems such as btrfs are based on the append-only feature of LSM to implement copy-on-write or multiversion. newer file systems such as btrfs also implement copy-on-write or multi-version features based on the LSM append-only feature. The EXT3/4 also has redo logging with journal/ordered/writeback mode.

Also, the Parallel Log-Structure FS follows the LFS to make parallel FS tailored to N-N write which speeds up the MPI write.



[CSE231 Paper Reading] The UNIX fast file system


  1. The original UNIX operating system was FFS's forerunner and was straightforward and user-friendly. But there were some serious issues with it:
  • Poor performance was the central issue: The disk bandwidth utilization in some circumstances, as measured in this paper, was only 3%. Moreover, the throughput was poor.
  • Low locality was the cause of this: Inodes and data blocks were dispersed around the disk because the original UNIX file system viewed the disk as a random-access memory.
  • Fragmentation of the file system was another issue due to careless management of the available space. The file system became fragmented as more and more data came in and went out; as a result, the free list eventually pointed to blocks scattered around the disk, making it challenging to access logically contiguous files without repeatedly traversing the disk. Tools for disk defragmentation initially provided a solution to this issue.
  • A more petite size was excellent for eliminating internal fragmentation (waste within the block) but poor for transfers because each block required a placement overhead. The original block size was too small (512 bytes).


FFS guarantees that accessing two files simultaneously won't necessitate extensive disk search times by grouping

  • The superblock's location is rotated, replicated, and stored in several cylinder groups. This lessens the possibility of data loss as a result of superblock corruption.

Layout guidelines

Simple is the basic mantra: Keep similar things close together and irrelevant things separate.

  • FFS looks for the cylinder group with the freest inodes and the fewest allocated directories to place directories.
  • Setting up files: A file's data blocks are first allocated to the same group as its inode. In the same cylinder group as other files within the same directory, file inodes are also given.
  • A file system block now has a 4096-byte size instead of the previous 512-byte size (4 KB). This raises performance due to
  • More data is accessed with each disk transfer.
  • It is possible to describe more files without having to visit indirect blocks (since the direct blocks now contain more data)

Block Size

A file system block now has a 4096-byte size instead of the previous 512-byte size (4 KB). This raises performance due to more data being accessed with each disk transfer. It is possible to describe more files without having to visit indirect blocks (since the direct blocks now contain more data)


Table 2b shows the reads and writes are around 3% to 47% faster in disk bandwidth.

Besides, FFS provided specific functionality improvements to the file system that have become standard in modern operating systems and will probably help FFS expand its user base:

  1. Long file names: File names are now almost limitless in length (previously: 8 characters. now: 255 characters)
  2. Programs can now lock files using advised shared or exclusive locks down to the individual file level.
  3. Symbolic links: Users now have considerably greater flexibility because they may construct "aliases" for any other file or directory on a system.
  4. Renaming files: Operation of atomic rename()
  5. Quotas: Limit the number of inodes and disk blocks a user may receive from the file system.
  6. Crash Consistency: FFS relies on the clean unmount of the filesystem to avoid consistency issues. In case of crash or power shortages, file system users have to invoke and wait for the lengthy file system consistency checker, fsck, which will detect consistency issues and attempt to recover without guarantee.

Evaluation effectiveness

I generally believe the paper since it adds up to 47% of the performance compared to the previous paper. I have a question for this paper 1. read-write lock performance compared with mandatory lock in different workloads 2. block wear leveling. 3. directory inode slow traverse. (here EXT4 fast start resolves this HERMES)


Spreading file blocks on disk can hurt performance, especially when accessing files sequentially. But this problem can be solved by carefully choosing the block size. Specifically, if the block size is large enough, the file system will spend most of its time transferring data from disk and only (relatively) little time looking for data between blocks of blocks. This process of reducing overhead by increasing the amount of work per operation is called amortization and is a common technique in computer systems.

The second clever mechanism is a disk layout that is optimized for performance. The problem occurs when files are placed in contiguous sectors. Suppose FFS first reads block 0, and when the read is complete, requests the first block, but by this time, the head has already crossed over block 1, and if it wants to read block one, it must reevaluate again. FFS solves this problem with a different data layout. By spreading the data out, FFS has enough time to request the next block before the head crosses the block. FFS can calculate for different disks how many blocks should be jumped for placement to avoid additional spins; this technique is called parametrization. Such a layout gets only 50% of the peak bandwidth since you have to bypass each track twice to read each block once. Fortunately, modern disks are smarter: the disk reads the entire way as a cache, and then, for subsequent reads of the track, only the required data is returned from the store. So the file system doesn't have to worry about these low-level details.

SoTA connection

The latest paper, HTMFS in FAST 22, still cites this paper for discussing a clean fcsk. FFS has no consistency guarantee if a crash happens before a clean unmount.

EXT4, F2FS, and so many PM FSes are obviously the descendent of FFS. However, the current FS more focus on the performance of the software with regard to the fast media of storage CXL attached persistent memory. We've seen hacks in software stacks like mapping all the operations like writing combining operations to firmware or wraps into a fast data structure like ART. We see the SoTA FS stack more referring to LFS or PLFS rather than FFS in terms of journaling and firmware FS.


  1. HTMFS: Strong Consistency Comes for Free with Hardware TransactionalMemory in Persistent Memory File Systems

Scheduler Activations

For managing the concurrency using threads, we have 1. the user-level library to make management in the application's address space. + performant and flexible - less functional 2. kernel modification + high functionality - pool performance, flexibility. Thus we demand a kernel interface combined with a user-level thread package, which can gain the above 2 pros.

User-level Threads are usually lightweight, the current design of the fiber library, can predict the process well and squeeze the context switch to the full.


The upcall performance drains down the performance (a better message call interface in the kernel using eBPF(XDP for network) or io_uring to asynchronously submit the activation may resolve this problem.



Lottery Scheduler


deterministically fairly to give time tick to another process. Current scheduler designs like CFS and ULE's drawback 1. At best, priority schemes are ad hoc. Top ranking always prevails. A feedback loop is used to change priorities to attain "fair share" over the (very) long run 2. P priorities on Unix are dynamic. 3. Lower-priority t due to priority inversion. 4. Schedulers are intricate, and other tasks may stop high-priority tasking to debate.


A probabilistic process scheduling algorithm is lottery scheduling. Each process is given a small number of lottery tickets, and the scheduler holds a drawing to select a key at random. The process with the winning ticket is given the CPU. The distribution of the ticket keys could not be uniform. A process has a greater chance of winning and being chosen for execution if it has more tickets than the competition. The problem of famine is resolved, and probabilistic fairness is assured because a process with at least one ticket key will ultimately win a lottery. The distribution and allocation mechanism utilized determines the system's throughput and responsiveness.

Ticket transfers: This technique is used when a client is held up while holding a resource due to some dependencies while another client is waiting for the first client to release the shared resource. The concept behind this is to shift the tickets to the blocked client to be more likely to execute, accomplish the task faster, and release the resource sooner to benefit both clients. Any client they depend on must be able to transfer their tickets to one or more other clients.

Ticket Inflation: The Cost of Tickets A client can produce new tickets for themselves in this case, which is seen as an alternative to direct ticket transfers. On the one hand, this strategy may cause issues because a client may monopolize a resource by printing several lottery tickets. However, inflation/deflation can be utilized to modify resource allocations without explicit client-to-client communication, resulting in higher throughput. Tickets for Compensation If a client only consumes a portion f of the resource unit allotted to it (for instance, CPU cycles/time), it may be given compensation that increases the number of tickets it receives by a proportion of 1/f. This guarantees equity and improved responsiveness when it comes to process scheduling. Interactive processes commonly show this behavior.

Experiment for Significance

They try to prove the effectiveness of fairness and flexible control for multiple resources to contend. They also discussed the concurrency control, but they didn't bring up the searching for ticket time which is O(log n)

Experiment Fallacy

The Monte Carlo experiment provides the dynamic control feature but not giving a good idea of the inner functionality's percentage. I think compensation tickets are unnecessary because, in the long term, the odds should be balanced between the clients. Because we have to keep track of how many compensation tickets each client has, there is a cost associated with this. On the other hand, I believe this is a good illustration of the fresh possibilities the system's dynamic presents.
Also, the graphs were unreadable, and it would have been better to use normalized charts.


The novelty is it describes a different scheduling system while you study this document regarding time-sharing tools. Probably the most straightforward scheduling algorithm is round-robin. It just keeps the resource occupied and has a high client throughput. But it's too easy. It does not offer the system any control over how much of the resourcehelp is allocated to each client. To address this, multilevel queue scheduling is used. The user has the option to group customers and assigns each group a distinct priority. We get control in exchange for new issues like priority inversion and famine. To address this issue further, mechanisms like priority inheritance and aging have been included. The scheduling algorithms have become more complex and challenging to comprehend.


It's a good approach, but not fast in today's operating system because of the complexity. If you write a distributed system, you definitely will use Round Robin + Consistent Hash and will not consider the scheduling with the nonconstant algorithm. But in other user applications, this scheduling will take place.
Also, we need to consider other metrics like the workload and context switch time to decide whether we need to use this, like implementing CFS/REAL_TIMERT/lottery and fallback to the different schedulers by different workloads.



ESESC Project Proposal

CSE220的project是写个模拟器,暂时想写的是(类承影的 RVV GPU SM subsystem。) 没空了,不写了.这个emulated和sniper最大的区别是这个模拟器是timesampling based,精度会高很多。

project organization

    qemuStarted     = true;
    QEMUArgs *qdata = (struct QEMUArgs *)threadargs;

    int    qargc = qdata->qargc;
    char **qargv = qdata->qargv;

    MSG("Starting qemu with");
    for(int i = 0; i < qargc; i++)
      MSG("arg[%d] is: %s", i, qargv[i]);

    qemuesesc_main(qargc, qargv, NULL);

    MSG("qemu done");



上层先走qemu先写了个bootloader,拿到Physical memory(其在qemu里插了trace记录memory。qemu 走完了(所以这里为毛不用JIT,可能是当时开发的局限吧。然后这边接一个指令parser。


  // 4 possible states:
  // rabbit  : no timing or warmup, go as fast as possible skipping stuff
  // warmup  : no timing but it has system warmup (caches/bpred)
  // detail  : detailed modeling, but without updating statistics
  // timing  : full timing modeling


-exec bt
#0  CCache::CCache (this=0x555558ae4360, gms=0x555558ae4280, section=0x555558ae42c0 "IL1_core", name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/CCache.cpp:78
#1  0x0000555555836ccd in MemorySystem::buildMemoryObj (this=0x555558ae4280, device_type=0x5555589fa820 "icache", dev_section=0x555558ae42c0 "IL1_core", dev_name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/MemorySystem.cpp:61
#2  0x000055555586c99e in GMemorySystem::finishDeclareMemoryObj (this=0x555558ae4280, vPars=..., name_suffix=0x0) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:320
#3  0x000055555586c497 in GMemorySystem::declareMemoryObj (this=0x555558ae4280, block=0x5555589ee260 "tradCORE", field=0x555555a7e573 "IL1") at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:242
#4  0x000055555586bb0e in GMemorySystem::buildMemorySystem (this=0x555558ae4280) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:148
#5  0x000055555581c764 in BootLoader::createSimuInterface (section=0x5555589ee260 "tradCORE", i=0) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:281
#6  0x000055555581c1c4 in BootLoader::plugSimuInterfaces () at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:219
#7  0x000055555581cba1 in BootLoader::plug (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:320
#8  0x000055555572274d in main (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/main/esesc.cpp:37

CPU Model

指令被qemu issue了以后会有个taskhandler接,prefetcher和address predictor 也实现在libcore里面.有LSQ和storeSet的实现,就是那个RMO的Load Store forward queue. OoO/prefetch实现也是最全的.

Memory Model

触发ld/st/mv等memory指令的时候会直接call memorysubsystem 核心内的L1/L2都会有MemObj记录现在的地址,自动机抽象成MemRequest,noblocking先查询MSHR,到了不同MemObj更新状态.到Memory侧有controller分配dim和Xbar分配load的路径.

GPU Model

这个走的是cuda模型,编译器会生成riscv host代码和gpu device代码,SM 会执行GPU代码,然后他们的SM流处理器记录的东西和CPU部分一样的metrics.

power emulation

用了一个pwr+McPAT的power model来完成libpeq的模拟,Component category可以模拟L1I等等,还有实时更新的McPT和Gstat的counter。只是现在暂时只有SRAM和cache的model,也没有row buffer的模拟和leakage的模拟,假装所有东西都是正确运行的。