Pointer Chase

远端(异构)内存+内存随机workload会产生交替执行;这样会导致数据依赖(pointer-chasing现象)和加载到缓存里的数据重用率不高。为了具体分析这两个因素对系统性能的影响,可以尝试分别针对L1 cache、L2 Cache、L3 cache、本地内存以及远端(异构)内存的顺序读。

Pointer Chasing 是一个可以被利用的technique来得到远端内存的下一个block或track free block。 inode就是最好的例子。

应用

  1. 最近多次在optane memory的RE writeup 中找到,说明MMU的信息可以通过ptr chaisng leverage.
  2. 可以利用数据依赖的特性,在handler里实现逻辑,比如transfer 数据结构,或者oom exception.

Reference

  1. https://www.ssrc.ucsc.edu/media/pubs/329b041d657e2c2225aa68fb33e72ecca157e6df.pdf
  2. https://arxiv.org/pdf/2204.03289.pdf

NFSlicer: Data Movement Optimizationfor Shallow Network Functions

最近一直在想网络SmartNIC延时的问题,发现Alex发过这篇,差不多是NSDI23的水平了。NF是用于抽象firewall/load balencer的函数,这些函数的语义会被记录在SmartNIC上。同时做一些offload的操作。这块陈昂做的很多。

传统的NIC只会做hardwired处理器操作,如TSO/LRO/checksum,用SmartNIC很自然就想到先抽象这些操作为network function,问题卷到现在演变成了谁能把serverless/ML的函数放在SmartNIC上更多,谁就顶会。

用一种SmartNIC+NIC offload pipeline的方式,把slice&splice 做的更低时延,但是这有个条件,绝大多数NF都是shallow的情况, 否则tail latency for pipeline会变得异常高.

他们觉得用上了CAT+DDIO的 private caches, LLC, and memory 都还不是bottleneck, 重在以下TTL的pipeline数据通路.

Reference

  1. AlNiCo: SmartNIC-accelerated Contention-aware Request Scheduling for Transaction Processing
  2. Automated SmartNIC Offloading Insights for Network Functions
  3. https://borispis.github.io/files/2022-nicmem-slides.pdf

Rust pitfall of bindgen to Variable Length Data Structure

I was writing a rust version of tokio-rs/io-uring together with @LemonHX. First, I tried the official windows-rs trying to port the Nt API generated from ntdll. But it seems to be recurring efforts by bindgen to c API. Therefore, I bindgen from libwinring with VLDS generated.

Original struct, the flags should not be 0x08 size big. I don't know it's a dump bug or something.

typedef struct _NT_IORING_SUBMISSION_QUEUE
{
    /* 0x0000 */ uint32_t Head;
    /* 0x0004 */ uint32_t Tail;
    /* 0x0008 */ NT_IORING_SQ_FLAGS Flags; /*should be i32 */
    /* 0x0010 */ NT_IORING_SQE Entries[];
} NT_IORING_SUBMISSION_QUEUE, * PNT_IORING_SUBMISSION_QUEUE; /* size: 0x0010 */
static_assert (sizeof (NT_IORING_SUBMISSION_QUEUE) == 0x0010, "");

The above struct should be aligned as:

Generated struct

#[repr(C)]
#[derive(Default, Clone, Copy)]
pub struct __IncompleteArrayField<T>(::std::marker::PhantomData<T>, [T; 0]);
impl<T> __IncompleteArrayField<T> {
    #[inline]
    pub const fn new() -> Self {
        __IncompleteArrayField(::std::marker::PhantomData, [])
    }
    #[inline]
    pub fn as_ptr(&self) -> *const T {
        self as *const _ as *const T
    }
    #[inline]
    pub fn as_mut_ptr(&mut self) -> *mut T {
        self as *mut _ as *mut T
    }
    #[inline]
    pub unsafe fn as_slice(&self, len: usize) -> &[T] {
        ::std::slice::from_raw_parts(self.as_ptr(), len)
    }
    #[inline]
    pub unsafe fn as_mut_slice(&mut self, len: usize) -> &mut [T] {
        ::std::slice::from_raw_parts_mut(self.as_mut_ptr(), len)
    }
}
#[repr(C)]
#[derive(Clone, Copy)]
pub struct _NT_IORING_SUBMISSION_QUEUE {
    pub Head: u32,
    pub Tail: u32,
    pub Flags: NT_IORING_SQ_FLAGS,
    pub Entries: __IncompleteArrayField<NT_IORING_SQE>,
}

The implemented __IncompleteArrayField seems right for its semantics of translating with slice and ptr. However, when I called the NtSubmitIoRing API, the returned data inside Field is random same result no matter moe the fiel d for what distance of Head.

TMO: Transparent Memory Offloading in datacenters

Both of the papers are from Dimitrios

Memory offloading

Because the memory occupation on a single node is huge, we are required to offload them into far memory.

They have to model what the memory footprint is like. And what's shown in the previous work zswap, it only has a single slow memory tier with compressed memory and they only have offline application profiling, which the metric is merely page-promotion rate.

Transparent memory offloading

Memory Tax comes can be triggered by infrastructure-level functions like packaging, logging, and profiling and microservices like routing and proxy. The primary target of offloading is memory tax SLA.

TMO basically sees through the resulting performance info like pressure stall info to predict how much memory to offload.

Then they use the PSI tracking to limit the memory/IO/CPU using cgroup, which they called Senpai.

IOCost reclaims not frequently used pages to SSD.

Reference

  1. Jing Liu's blog
  2. Software-Defined Far Memory in Warehouse-Scale Computers
  3. Cerebros: Evading the RPC Tax in Datacenters
  4. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator