## LATTE

### Exploring Performance of Cache-Aware Tiling Strategies in MLIR Infrastructure

Intel OneDNN在MLIR上approach

Versal ACAP HLS

### A Scalable Formal Approach for Correctness-Assured Hardware Design

Jin Yang 大师的，之前在AHA讲过了，

## Session 1B: Shared Memory/Mem Consistency

### Probabilistic Concurrency Testing for Weak Memory Programs

![](media/16792628578417/16799399312657

hit bug 更快

Hieristic for h is good enough for data structure test. assertion tests looks great, When I was in shanghaitech, there’s people using the same tool on PM.

'

### MC Mutants: Evaluating and Improving Testing for Memory Consistency Specifications

Transform disallowed memory to weak memory label.

## Coding and Cloud Storage

### Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)

• MR-LRC，every group has a local redundancy and global redundancy.
• Use Cauchy LRC to predict the distance
• 96-105 code per stripe >=6 time failure reliability test/Time to recover/ Mean-time-to-data-loss.

VAST Data's approach? LDC: workloads matters.

### ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage

Repair penalty of RS Code

• Reduce bandwidth: amount of traffic transferred in the network

For SoTA (4,2)Clay code, the blocks size is 256MB while bandwidth and MRL are both 384MB.

Repair can be pipelining while the block can be chunk into blocks, (bw ,MRL)=>(512,256), drawback is additive associative. But we can subpackerize the block.

pECDAG to parallel GC for Clay codes.

### InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication

Cloud Tiering and backup data requires Deduplication in the cloud hypervisor level. They get teh fingerprint and semantic from the cloud tier and send the request to the local tier.

The GC algortihm is basically BatchDedup.

fingerprint Indexing in SSD.

## Key-Value Stores

### ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance

RocksDB会自己increase memtable？

Does RDO control compaction？ All layer compaction

RDO MMO encoder Transformer？Future work。

### FUSEE: A Fully Memory-Disaggregated Key-Value Store

Client centric index replications
Remote Memory Allocation: RACE hashing+SNAPSHOT

RACE Hashing: onesided RDMA hashing

Primary and write-write conflict resolution by last-writer(majority writer) wins and update the other place thereby.

Compared with MDS that has key to hash for direct access to RDMA result. only 1 RTT for read.

embedded operation log

### ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems

Learned key data movement and allocation requires recomputed decoupled learned indexes. The desired data leaves are fixed size bby the retraining decouple algorithm mathematically, which makes migration easier.

Consistency Guarantee, 感觉要同步的model的RDMA操作挺多的。

bit as lock? how to recover from the lock? future work.

Always update new model every read? first check the first entry of SLT. and update the model if changed.

## File Systems

### CJFS: Concurrent Journaling for Better Scalability

Compound flush: On Tx finish cache(DMA, which can be hack by CXL？) barrier send the interrupt to the storage.

### Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities

name collision, 之前有个git CVE。

### ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit

Compared with Hydra

### Fisc: A Large-scale Cloud-native-oriented File System

RPC serialization? 没有指针，docker/container的metadata都是flat的（只有offset），他们RPC几乎相当于RDMA over memcpy，很lightweight。 load balancer只做file-granularity based。如果DPU硬件fail了，直接RDMA migrate docker，其他请求fallback slow path，很快，用户无感。如果用户觉得IO慢了会提供全链路trace。

## Persistent Memory Systems

### MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems

file+dummy（2MB granularity）重排了文件写在DAX Pmem上，有写放大。concurrent control用的是CAS。metadata concurrent write好像没有处理，后来理解了因为open的寓意保证了write不会contension。

### On Stacking a Persistent Memory File System on Legacy File Systems

Stackable Memory Filesystem over normal filesystem.上面的内存文件系统比较轻量extent RB tree。Sync file system factor设计和dirty page wb有点像。

extent hashing和下面同步

Ext-CAS+Ext-TAA

## IO Stacks

### λ-IO: A Unified IO Stack for Computational Storage

Why not Native？统一框架。 我们发现mmap的部分都是计算时间，于是，load from device mmap+page fault and trigger userspace WebAssembly computation? for mmap very fast。

CXL+eBPF，small pagetable for transaction？fast getting metrics from cacheline network，decide load or store and dma？

### SMRSTORE: A Storage Engine for Cloud Object Storage on HM-SMR Drives

15%reduction of time

## yyw 的2022 年终总结

5月25日去往广州签证,遇到了很多高中同学,我觉得高中同学对我的教育,是零和游戏,一个显然在高考能取得更多分的人就是有从你手中拿到更多资源,但是他们是一群为了抢夺资源不择手段且孤高自傲,不与你分享的人.我觉得这不是一个健康的竞争.但是他们经历了清华、密院、同济妓院的搏杀之后,也知道了人各有志.确实我是唯一一个高中CS读博的人.广州之后去的珠海、深圳、湘潭、怀化、张家界.真的是我的寻根之旅?见外公是最后一次了.还好去看了一次.

## täko: A Polymorphic Cache Hierarchy forGeneral-Purpose Optimization of Data Movement

(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.

They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss() onEviction() writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.

In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.

However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.

## The Linux Scheduler: a Decade of Wasted Cores

The main novelty is that Linux uses a CFS algorithm with a weighted fair queuing algorithm. Say the CFS time-slices the CPU using the red-black tree to deal with the running thread. Without I/O, we may have a better performance than O(1) . Each CPU's run queue cfs_rq maintains a min_vruntime field that records the minimum value of vruntime for all processes in that run queue. The initial vruntime value of a new process is set based on the min_vruntime of the run queue it is in, keeping it within a reasonable gap from the old process.

## Pros

The dormant process is compensated by vruntime when it wakes up, and its ability to grab the CPU when it wakes up is a probable event, which is the intent of the CFS scheduling algorithm, i.e., to guarantee the responsiveness of interactive processes, which would be frequently dormant waiting for user input.

Imagine when you perform each interactive operation such as hitting the keyboard, moving the mouse, etc., for the system this is when a new task comes -> runtime is 0 -> vruntime is 0 -> is put into the leftmost node of the red-black tree of the scheduling task queue -> the leftmost node is pointed to via a special pointer and that pointer is cached

## Cons

1. Priority failure. For example, there are two processes, process A has a nice of 0 and an allocated time slice of 100ms, process B has a nice of +20 and an allocated time slice of 5ms. if we have a time slice of 5ms, after process A runs for 5ms, it will switch to process B, which will also run for 5ms, so within 10ms, process A and process B will run for the same amount of time And so on, within 100ms, process A and process B run for the same amount of time, which is obviously inconsistent with the time slice we've allocated. （This is obviously inconsistent with our time slice allocation. (I think I understand it this way, I read it several times and didn't understand it)
2. Relative nice value. For example, if there are two processes, the first one assumes a nice value of 0 and the second one assumes a nice value of 1. These two processes will be allocated time slices of 100ms and 95ms respectively. But if the two processes have a nice value of 18 and 19, they will be allocated 10ms and 5ms respectively, which is less efficient for the whole cpu.
3. The absolute time slice needs to be on a timer beat. When the time slice is reached, the process needs to be switched, at which point, an interrupt from the timer is needed to trigger it, in which case the time slice for both processes can't be split that fine, because the timer beat has to be met. (Can be calculated in using another value as separate from the timer beat)
4. Based on the priority adjustment problem. To raise the priority of a newly woken process in order for the process to be up and running faster.

There are no crashes or out-of-memory situations, and tools like htop, sar, or perf cannot detect the missing short-term idle periods, making it difficult to detect the issues reported in this research. The authors refer to the first tool as a "sanity checker." It checks that no core is idling while waiting threads are in the running queue of a different core. It permits this condition to exist for a brief time but issues an alert if it continues. A visualizer that displayed scheduling activity over time was the second tool. As a result, the number of running queues, their overall load, and the cores that were taken into account during routine load balancing and thread wakeups may all be profiled and plotted. Scheduling, as in allocating CPU time among threads, was once considered an issue that had been resolved. We demonstrate that this is untrue. A straightforward scheduling principle produced a highly intricate, bug-prone implementation in order to accommodate the complexity of current hardware. We found that scheduling waiting threads onto idle cores breaks a fundamental work-conserving invariant. Runnable threads might get held in running queues for a few seconds if there are idle cores in the system, which would cause a significant decrease in application performance. These vulnerabilities are difficult to find using standard tools because of their nature. We repair these flaws, identify their underlying causes, and provide tools that make finding and correcting these bugs much simpler. As for the future, Of course, the Linux kernel as a code, it is universal, so it is difficult for the community to see and focus on multi-core issues alone, the community is most concerned about maintainability, not performance. new features of Linux in 128MB memory i386 machine running no problem, that is OK. As long as not more than 80% of people encounter new problems, the community is never care, at the same time, because of this, the community will introduce bugs, which is also want to sigh can not sigh. My opinion is that the community is just a community of programmers who take the code as the criterion, and the community does not pay too much attention to the development of architecture and new features, which are all vendor things. Likewise with TCP, which I have always sprayed, people focus on the TCP implementation code itself, which is what makes it more and more complex, and then more and more fragile, which you might say is evolution, but isn't there a chance to go back to the drawing board before all hell breaks loose? It hasn't evolved to the point where it must continue to evolve, right? If you stand outside to see and have mandatory measures, it is estimated that there is no garbage TCP long ago.