OSDI22 - vickieGPT’s blog

October 19, 2022October 22, 2022

Achieving 100Gbps Intrusion Prevention a Single Server @OSDI' 20

FPGA offload 入侵检测， performance/power efficiency balance。这篇的insight就是要做above 100Gbps TCP的on chip计算，比如负载均衡、能量负载、安全等如果放在CPU或者PIM上算都太慢了，所以搞了这个near NIC的computation。在交易所网络包发送的过程中也有类似的需要更改简单逻辑的场景，运用smartNIC在保证volatility的条件下可以大大减少延时。这篇的第二个insight是用到了intel hyperscan尽可能software 提速匹配IDS/IPSA。

第三个insight是硬件调度优化三个操作Regex matching rules/TCP reassemble/other task。

其中full matcher轮询一个由DMA engine填充的环形缓冲器。每个数据包都携带元数据，包括MSPM确定为部分匹配的规则ID（hyper scan的后半部分）。对于每个规则ID，完全匹配器检索完整的规则（包括正则表达式）并检查是否完全匹配。

TCP resembler 是一个ooo的设计。packets会先渠道fast path，再到一个bram的cuckoo hashing table（flow table）， insertion table 会弥补不同执行时间的ooo engine。

为了减少在FPSM的hash table lookup，其还写进去了一个SIMD shift or matching（显然应该不会比商用的intel的fpu写的快。（不过在FPGA上塞这么多逻辑.

August 3, 2022September 17, 2022

TMO: Transparent Memory Offloading in datacenters

Both of the papers are from Dimitrios

Memory offloading

Because the memory occupation on a single node is huge, we are required to offload them into far memory.

They have to model what the memory footprint is like. And what's shown in the previous work zswap, it only has a single slow memory tier with compressed memory and they only have offline application profiling, which the metric is merely page-promotion rate.

Transparent memory offloading

Memory Tax comes can be triggered by infrastructure-level functions like packaging, logging, and profiling and microservices like routing and proxy. The primary target of offloading is memory tax SLA.

TMO basically sees through the resulting performance info like pressure stall info to predict how much memory to offload.

Then they use the PSI tracking to limit the memory/IO/CPU using cgroup, which they called Senpai.

IOCost reclaims not frequently used pages to SSD.

Reference

Jing Liu's blog
Software-Defined Far Memory in Warehouse-Scale Computers
Cerebros: Evading the RPC Tax in Datacenters
Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

July 29, 2022August 9, 2022

Application-Informed Kernel Synchronization Primitives

一个runtime dynamic profiling/loading guided lock primitive，可以同时兼顾到high/no lock contention的情况。

Continue reading "Application-Informed Kernel Synchronization Primitives"

July 18, 2022September 14, 2022

IOring Windows at first sight and migration to `monoio`

最近在和LemonHX一起写个跨平台下载器，想要的是个延时确定的协程调度器，然后我们就看上了字节开源的monoio，准备贡献一波Windows部分。

主要需要跨平台抽象的部分已经写好了, GAT 刚进主线, 其实感觉贡献这个更经济一点. 字节内好像也没有开始用这个, 只是做了点测试.

July 18, 2022August 5, 2022

Twizzler: a Data-Centric OS for Non-Volatile Memory

A Distributed NVM modeled OS runtime for the arrangement of NVM data structure that

Does not require explicit (un)loading
Less serialization (Context available persistent ptr to reduce memcpy).
Has basic support for security/ share/ crash atomicity

Continue reading "Twizzler: a Data-Centric OS for Non-Volatile Memory"

June 10, 2022September 14, 2022

First-generation Memory Disaggregation for Cloud Platforms @Arxiv

CXL disaggregation because:

Memory inefficiency: s platform-level memory stranding
Current cloud vendor try on memory disaggregation: require no modifications to
customer workloads or the guest OS./ the system
must be compatible with virtualization acceleration techniques/ the system must be available as
commodity hardware.

Continue reading "First-generation Memory Disaggregation for Cloud Platforms @Arxiv"