OSDI/ATC22 attedency

文章目录[隐藏]

ATC的容量大改是OSDI的1.5倍吧,有两个track两天的样子,ATC的工业界浓度更高一些,OSDI会选出那种新颖的方向,真的有点像奥斯卡的选奖逻辑。

OSDI

Keynote from google




Distributed Storage and Far Memory

BlockFlex: Enabling Storage Harvesting with Software-Defined Flash in Modern Cloud Platforms

黄建爸爸的文章,presenter 好像之前做secure HDD也是这个老哥, a learning-based storage harvesting framework。



MemLiner: Lining up Tracing and Application for a Far-Memory-Friendly Runtime

这里可以理解为每次发现一个指向distant object 的reference 就delay 并hold 住它几次。(比如放到另一个queue)最终这些发现的reference 都要trace一遍直到没有发现 unmarked object,所以closure 是会达成的。






Carbink: Fault-Tolerant Far Memory

far memory 的 fault tolerence,高可用node可以raft把三个replica连起来,还有Erasure Coding的方法。这篇文章用spanset的方法监控和一些hieristics参数来保证fault tolerant。




Bugs

Metastable Failures in the Wild




找的逻辑和profiling一样。

Demystifying and Checkirig Silent Semantic Violations in Large Distributed Systems













inference rule 会在post scan的时候由scan的时候给出

PM

ListDB: Union of Write-Ahead Logs and Persistent SkipLists for Incremental Checkpointing on Persistent Memory

PM写index-unified log, numa aware skiplist存pointer然后写优化的zipper compaction,会在dram中放lookup cache为了减少numa开销。(cxl出来以后有external MC了会不会不用管这回事了?)

ODinFS





inkernel fs taking NUMA, bandwidth and thread thrashing into consideration in one device.

Using thread delegation as decoupled daemon in local PM access




MMAP: like NOVA MMAP propagate and then MUNMAP

Durinn: Adversarial Memory and Thread Interleaving for Detecting Durable Linearizability Bugs






Construct the test:







Serverless

ORION: Optimized Execution Latency for Serverless DAGs




Metrics

Immortal Threaeds


low battery下的thread存stack和一些ram state在dram上。可以把很多数据结构移植过来。

OmniTable Way



Storage

XRP: In-Kernel Storage Functions with eBPF



Tricache

Machine Learning 2

Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences

这篇挺厉害的,REEF改了rocm的驱动,我们知道为了低延时,我们需要一个bounded的scheduler。现有的GPU SIMT scheduler并不提供一个稳定的可控的时延。如果给我做这种事,会应用层接管scheduler,通过观察DNN的pattern,做流处理器和SM的FIFO affinity。refer to zhejia的那篇NV的microbench结果来设计这个scheduler,而REEF做的更多。REEF对实时任务使用FIFO策略,该策略是非抢占式的。使用FIFO策略使得预测一个实时任务的响应延迟变得容易。在大多数实时应用中,延迟的可预测性是很重要的。而REEF可以确保实时任务总是有一个可预测的响应时间。例如,自身的执行时间加上队列中先前任务的执行时间。如果我们允许实时任务的并发执行,这种可预测性可能会被违反。

Serverless

CAP-VM



KSplit: Automating Device Driver Isolation









主要就是这里的shared field analysis




Orbit



找到subprocess的最优隔离方式。






(在kernel里delegate object seems a good technique)

确实效果很好,不知道mysql的deadlock逻辑可不可以FaaS出来。然后把内存ro map过去检查状态。

From Dynamic Loading to Extensible Transformation: An Infrastructure for Dynamic Library Transformation






Application-Informed Kernel Synchronization Primitives

似乎是比之前libASL更屌的东西。对kernel的eBPF观测+livepatch lock heuristic





Managed Languages

UPGRADVISOR: Early Adopting Dependency Updates Using Production Traces




Dynamic trace the function diffs




Cast bytecode jump into hardware jump

ATC

Storage 1

ZNSwap: un-Block your Swap

ZNS上的DRAM和underlying block device的swap。

Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio

Distributed System

uKharon: A Membership Service for Microsecond Applications







KRCore



ZCOT

PrivBox




eBPF?non-intrusive linux code/ different safety model

Bugs

KSG: Augmenting Kernel Fuzzing with System Call Specification Generation













用eBPF和kprobe拿type/constraints信息喂给skywalker

coverage很难到100,因为没有硬件模拟interrupt,现有的只是syscall triggered

DLOS: Effective Static Detection of Deadlocks in OS Kernels
















最后还是要用z3 path verifier filter一下。



Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models

也是用runtime state喂path condition给solver看bug。















Disaggregated Systems

Sibylla: To Retry or Not To Retry on Deep Learning Job Failure

对DL GPU任务的disaggregation。如果failure是基于log可预测的,可以直接kill,重启。也算ML for system。

Speculative Recovery: Cheap, Highly Available Fault Tolerance with Disaggregated Storage


Direct Access, High-Performance Memory Disaggregation with DirectCXL

CXL.mem disaggregation focuses on direct access and high performance. 这篇是第一个做CXL.mem disaggregation。

  • Prior work
    • RDMA
      (one-sided communication w/o letting CPU aware)

      • require local pages sent to Memory translation Table with MR(Memory Region)
    • Swap: Page-based Memory Pool: kernel swapd locally cached data in a finer granular manner
    • KVS (类似PRISM那种)Object-based Memory Pool: 基于对象的系统为主机和内存节点两方创建两个MR。各自处理缓冲数据和提交/完成的队列(SQ/CQ)。一般来说,它们采用一个KV哈希表其条目指向相应的(远程)内存对象。每当有来自一个应用程序的Put(或Get)请求时,系统就会将相应的值放到时,系统就会将相应的值放入主机的缓冲区MR中,并将该值放入缓冲区,并通过RDMA将其写入远程的MR,并通过RDMA写入远程的SQ MR。由于内存节点不断轮询SQ MR,它可以识别该请求。
  • 通过CXL连接device&host memory弄成一个pool。
    • CXL devices and controllers. CXL device 会有多个memory controller, send/recv 走PCIe Lane, device CXL controller会parse PCIe packets(PCIe flits),会做两件事
      • Converts their information (address and length) to DRAM requests
      • Serves them from the underlying DRAMs using the DRAM controller
    • Integrating devices into system memory. for one-sided communication, root port and endpoint for more CXL devices. Our host-side kernel driver first enumerates CXL devices by querying the size of their base address register (BAR) and their internal memory, called host-managed device memory (HDM), through PCIe transactions. 如果CPU通过load/store指令调用HDM,request会像之前的PCIe一样先访问root complex,再翻译到CXL flits,由于mapped 地址空间不一样这,步只翻译HDM的基址到underlying DRAM。这步作者claim可以达到高性能。
    • CXL network switch
    • Software runtime direct memory access to /dev/directcxl which is a minimized version of PMDK(filesys for crash consistency or DAX for 64B atomicity)
  • Design 插了4个pcie,2个device(end point controller+dram host/controller都是RISC-V ISA) 1个switch with cxl flit,协议 CXL2.0 ACPI信息会集成到device tree给linux和驱动。驱动包含MMIO和cxl-reserved-area(HDM)。比较的是RDMA的memory pooling,即一个daemon接管RNIC所设计的 pre-allocated pool,类似KVS(PRISM那种)。


    cxl是对local设备的访问,而不是经过连接CXL的NIC,感觉测试很不公平,DMA(两边PCIe+Memory Reg+网络)就要两倍CXL的时间。论文claim说这个CXL的设备在外面,可以被多个CXL swtich 访问。

这边CXL的load latency CDF可以看到基本是两倍的时延。这边标出CPU的不同是为了测跨时序分析。

这边测了KVS和DirectCXL的差别,我觉得差距基本可以忽略,毕竟你这是对面没有负载的情况,NIC还是会灵活很多。这边就是bandwidth optimized优化。

总之这个prototype比较简易,只有memory pooling disaggregation,似乎也不支持scale到32个node。感觉能在intel机器上实现模拟器会很好,linux里面已经支持了,文章其实更好加一下现有设备的支持(不过韩国人可能拿不到intel预研)。

Security

SoftTRR


page table 1 上加了个tracer看memory access。benchmark用deviation1和6的测试来看rowhammer的robustness。

Investigating Managed Language Runtime Performance





感觉GCC malloc不应该这么慢,有gc的语言应该还要慢一点。不过这种比较也算fair comparison。

NVM

FlatFS

扁平化的命名空间架构。

为range query 优化的index tree

Write compressed key

这篇感觉优化的点就在path walk

StRAID




Vinter



又是一个dynamic tracer,static tester。





TestCase Auto Generation? No
Only Journal Layer? No
How to capture the Crash State?

AlNiCo: SmartNIC-accelerated Contention-aware Request Scheduling for Transaction Processing







Overhead is not introduced by the SmartNIC

FPGA NIC








Better than arm-based SmartNIC because of the bandwidth.
Better than GPUDirect, it only mapped registers and manipulate the SmartNIC and at that time CPU become the boss, lose the parallelization.

software enabled eBPF

好像就是一个optimizer?减少hXDP到device memory memcpy?










context restoration unit就是个跳转表。把程序分析成PC自动机,这里Warp Engine没有stall


这种就是FPGA可以比CPU更好的解释执行instruction的地方。


Crash Consistency

NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow
















感觉在做概率论。fail-stop测的很好,fail-slow trace光测延时有啥用?还是要eBPF来知道root of cause。

CacheSack












DynsmicDB