Study 📕 - Page 8 - vickieGPT’s blog

September 23, 2022November 5, 2022

Pointer Chase

远端（异构）内存+内存随机workload会产生交替执行；这样会导致数据依赖（pointer-chasing现象）和加载到缓存里的数据重用率不高。为了具体分析这两个因素对系统性能的影响，可以尝试分别针对L1 cache、L2 Cache、L3 cache、本地内存以及远端（异构）内存的顺序读。

Pointer Chasing 是一个可以被利用的technique来得到远端内存的下一个block或track free block。 inode就是最好的例子。

应用

最近多次在optane memory的RE writeup 中找到,说明MMU的信息可以通过ptr chaisng leverage.
可以利用数据依赖的特性,在handler里实现逻辑,比如transfer 数据结构,或者oom exception.

Reference

September 19, 2022September 23, 2022

Memory Bound from Vtune

Recently, I was testing a bunch of memory-bound benchmarks using Vtune and outcomes some weird results. I'm trying a bunch of micro benchmarks to see how memory bound works for bandwidth and latency.

Continue reading "Memory Bound from Vtune"

September 14, 2022September 18, 2022

NFSlicer: Data Movement Optimizationfor Shallow Network Functions

最近一直在想网络SmartNIC延时的问题，发现Alex发过这篇,差不多是NSDI23的水平了。NF是用于抽象firewall/load balencer的函数，这些函数的语义会被记录在SmartNIC上。同时做一些offload的操作。这块陈昂做的很多。

传统的NIC只会做hardwired处理器操作，如TSO/LRO/checksum，用SmartNIC很自然就想到先抽象这些操作为network function，问题卷到现在演变成了谁能把serverless/ML的函数放在SmartNIC上更多，谁就顶会。

用一种SmartNIC+NIC offload pipeline的方式,把slice&splice 做的更低时延,但是这有个条件,绝大多数NF都是shallow的情况, 否则tail latency for pipeline会变得异常高.

他们觉得用上了CAT+DDIO的 private caches, LLC, and memory 都还不是bottleneck, 重在以下TTL的pipeline数据通路.

Reference

AlNiCo: SmartNIC-accelerated Contention-aware Request Scheduling for Transaction Processing
Automated SmartNIC Offloading Insights for Network Functions
https://borispis.github.io/files/2022-nicmem-slides.pdf

September 13, 2022January 14, 2024

cgroup memory division of NUMA

Problem Setup

I found in the linux manual that if I want to divide memory by X:Y on 2 NUMA nodes. I can leverage the policy of cpuset that has a higher priority of cgroup memory policy. A formal description of this lies here and there.

Continue reading "cgroup memory division of NUMA"

August 30, 2022November 18, 2023

MICRO attendency

去是去不了了, 但是还是看了一下paper list.然后发现只要去就能提前看到paper list，这种传统还挺有意思的。

Continue reading "MICRO attendency"

August 29, 2022September 6, 2022

Rust pitfall of bindgen to Variable Length Data Structure

I was writing a rust version of tokio-rs/io-uring together with @LemonHX. First, I tried the official windows-rs trying to port the Nt API generated from ntdll. But it seems to be recurring efforts by bindgen to c API. Therefore, I bindgen from libwinring with VLDS generated.

Original struct, the flags should not be 0x08 size big. I don't know it's a dump bug or something.

typedef struct _NT_IORING_SUBMISSION_QUEUE
{
    /* 0x0000 */ uint32_t Head;
    /* 0x0004 */ uint32_t Tail;
    /* 0x0008 */ NT_IORING_SQ_FLAGS Flags; /*should be i32 */
    /* 0x0010 */ NT_IORING_SQE Entries[];
} NT_IORING_SUBMISSION_QUEUE, * PNT_IORING_SUBMISSION_QUEUE; /* size: 0x0010 */
static_assert (sizeof (NT_IORING_SUBMISSION_QUEUE) == 0x0010, "");

The above struct should be aligned as:

Generated struct

#[repr(C)]
#[derive(Default, Clone, Copy)]
pub struct __IncompleteArrayField<T>(::std::marker::PhantomData<T>, [T; 0]);
impl<T> __IncompleteArrayField<T> {
    #[inline]
    pub const fn new() -> Self {
        __IncompleteArrayField(::std::marker::PhantomData, [])
    }
    #[inline]
    pub fn as_ptr(&self) -> *const T {
        self as *const _ as *const T
    }
    #[inline]
    pub fn as_mut_ptr(&mut self) -> *mut T {
        self as *mut _ as *mut T
    }
    #[inline]
    pub unsafe fn as_slice(&self, len: usize) -> &[T] {
        ::std::slice::from_raw_parts(self.as_ptr(), len)
    }
    #[inline]
    pub unsafe fn as_mut_slice(&mut self, len: usize) -> &mut [T] {
        ::std::slice::from_raw_parts_mut(self.as_mut_ptr(), len)
    }
}
#[repr(C)]
#[derive(Clone, Copy)]
pub struct _NT_IORING_SUBMISSION_QUEUE {
    pub Head: u32,
    pub Tail: u32,
    pub Flags: NT_IORING_SQ_FLAGS,
    pub Entries: __IncompleteArrayField<NT_IORING_SQE>,
}

The implemented __IncompleteArrayField seems right for its semantics of translating with slice and ptr. However, when I called the NtSubmitIoRing API, the returned data inside Field is random same result no matter moe the fiel d for what distance of Head.

August 15, 2022August 15, 2022

SymSan: Time and Space Efficient Concolic Execution via Dynamic Data-Flow Analysis

在LLVM上用concolic exuction拿到DFG path feed 给fuzzer和santinizer。算法上有很多剪枝优化。还在看实现。

August 12, 2022September 14, 2022

Understanding the Effect of Data Center Resource Disaggregation on Production DBMSs

用LegoOS和Linux的不同性能对比得出了未来DBMS disaggregation 怎么设计。

August 5, 2022September 14, 2022

Resource-Centric Serverless Computing

FAAS对resource(CPU/Memory) disaggregation的map-reduce化?这显然是一个完全不考虑NUMA/NIC latency的考虑，就是在说dynamic memory allocation/placement大过latency的tradeoff,重点落在了怎么藏RDMA/NUMA的时延.但是为什么要套一个FAAS呢?然后就和MemTrade没啥大区别,只是多家了一套用户配置.

Reference

Resource-Centric Serverless Computing

August 3, 2022September 17, 2022

TMO: Transparent Memory Offloading in datacenters

Both of the papers are from Dimitrios

Memory offloading

Because the memory occupation on a single node is huge, we are required to offload them into far memory.

They have to model what the memory footprint is like. And what's shown in the previous work zswap, it only has a single slow memory tier with compressed memory and they only have offline application profiling, which the metric is merely page-promotion rate.

Transparent memory offloading

Memory Tax comes can be triggered by infrastructure-level functions like packaging, logging, and profiling and microservices like routing and proxy. The primary target of offloading is memory tax SLA.

TMO basically sees through the resulting performance info like pressure stall info to predict how much memory to offload.

Then they use the PSI tracking to limit the memory/IO/CPU using cgroup, which they called Senpai.

IOCost reclaims not frequently used pages to SSD.

Reference

Jing Liu's blog
Software-Defined Far Memory in Warehouse-Scale Computers
Cerebros: Evading the RPC Tax in Datacenters
Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator