NFSlicer: Data Movement Optimizationfor Shallow Network Functions

最近一直在想网络SmartNIC延时的问题,发现Alex发过这篇,差不多是NSDI23的水平了。NF是用于抽象firewall/load balencer的函数,这些函数的语义会被记录在SmartNIC上。同时做一些offload的操作。这块陈昂做的很多。

传统的NIC只会做hardwired处理器操作,如TSO/LRO/checksum,用SmartNIC很自然就想到先抽象这些操作为network function,问题卷到现在演变成了谁能把serverless/ML的函数放在SmartNIC上更多,谁就顶会。

用一种SmartNIC+NIC offload pipeline的方式,把slice&splice 做的更低时延,但是这有个条件,绝大多数NF都是shallow的情况, 否则tail latency for pipeline会变得异常高.

他们觉得用上了CAT+DDIO的 private caches, LLC, and memory 都还不是bottleneck, 重在以下TTL的pipeline数据通路.

Reference

  1. AlNiCo: SmartNIC-accelerated Contention-aware Request Scheduling for Transaction Processing
  2. Automated SmartNIC Offloading Insights for Network Functions
  3. https://borispis.github.io/files/2022-nicmem-slides.pdf

PCIe/CXL “网络层”通讯详解

引入

首先,我们为什么需要一个PCIe attached memory or cache协议,重点是CPU上memory channel的局限性,你无法多加过多的并行的Memory Bus。虽然这对memory的随机读写有好处,现在的channel个数大概满足了CPU-memory Ratio,开多少线程跑load&store都能满足CPU的需求。

过多的线程会在核内空转也不会issue超过CPU频率和算力的内存指令。大家可以想象一个roofline model x轴为什么是arithmatic density的原因。同时淘汰浪费内存带宽的3D Xpoint也成了必然。那么串行的PCIe协议访问memory就非常有意义了,Meta的workload告诉我们80%的互联网应用是capacity bound,意思是我有一个很大的data warehouse,需要low latency访问的,也即是用户即将要显示在终端设备上的其实很少。只需要保证在短时间内load到private DRAM,就满足了。

例子

让我们从两个实际例子开始。

  1. 如果今天有人要创建一个基于PCIe的内存扩展设备,并希望该设备能够暴露相干字节寻址的内存,那么实际上只有两个可行的选择。一个人可以通过基础地址寄存器(BAR)暴露这个内存映射的输入/输出(MMIO)。如果没有Hack,唯一合理的方法是需要有CPU支持,将MMIO映射为未缓存(UC),这对性能有明显的影响。关于对GPU的连贯性内存访问的更多细节,可以看看Nvidia 的Hack。对设备内存的访问不受协议的限制,而我们还没有设法完成这个目标。事实上,NVMe 1.4规范引入了持久性内存区域(PMR,区别于C++20的pmr),它可以做到这一点,但仍然是有限的。

  2. 如果创建一个基于PCIe的设备,其主要工作是进行网络地址转换(NAT)(或其他一些IP数据包修改),这将由CPU完成,为此需要关键的内存带宽。这是因为CPU将不得不从设备中读取数据,对其进行修改,并将其写回,而通过PCIe的唯一方法就是通过主内存来完成。

传输格式

通过串行的传输协议我们会获得Non-deterministic memory latency,除了极端情况下放在核电厂旁边不停丢包以外,更会受到CXL Switch over subscription的影响.

使用DRAM介质直连CPU的内存和NVDIMM不到100ns,通过PCIe串行连接的缓存一致性协议CXL(XMM、NV-XMM模组和AIC)、CCIX可以达到350ns延时;OpenCAPI的DDIMM也只有40ns;而Gen-Z这样经过外部Switch/网络连接的在800ns水平。

PCIe 传输格式

包头所对应的不同层传输格式



Memory configuration space有32bit BAR限制.需要一开始就指定是32/64来获得3DW还是4DW

Completion 返回的 Ack 是分别对应之前的 Memory 请求。

最后值得注意的是Transaction Descriptor Attribute 会指定IO的Ordering和CPU的Ordering/Snooping

End Point通常是我们最感兴趣的,因为那是我们放置高性能设备的地方。它是样本框图中的GPU,而在实时情况下,它可以是一个高速以太网卡或数据收集/处理卡,或一个infiniband卡与大型数据中心的一些存储设备communication。下面是一个框图,放大了这些组件的互连。

基于这个拓扑结构,让我们来谈谈一个典型的场景,其中远程直接内存访问(RDMA)被用来允许终端PCIE设备在数据到达时直接写入预先分配的系统内存,这最大限度地卸载了CPU的任何参与。因此,设备将发起一个带有数据的写入请求,并将其与希望的Root ComplexRoot一起发送,其将数据输入系统内存.

CXL 增加了啥

Reference

  1. https://par.nsf.gov/servlets/purl/10078086
  2. https://www.youtube.com/watch?v=Uff2yvtzONc
  3. https://bwidawsk.net/blog/2022/6/compute-express-link-intro/#cxl.mem
  4. https://www.computeexpresslink.org/download-the-specification
  5. https://www.youtube.com/watch?v=fpAFvLhTpqw
  6. https://www.youtube.com/watch?v=caiREMKP0-E&t=7s

IOring Windows at first sight and migration to `monoio`

最近在和LemonHX一起写个跨平台下载器,想要的是个延时确定的协程调度器,然后我们就看上了字节开源的monoio,准备贡献一波Windows部分。

主要需要跨平台抽象的部分已经写好了, GAT 刚进主线, 其实感觉贡献这个更经济一点. 字节内好像也没有开始用这个, 只是做了点测试.

First-generation Memory Disaggregation for Cloud Platforms @Arxiv

CXL disaggregation because:

  1. Memory inefficiency: s platform-level memory stranding
  2. Current cloud vendor try on memory disaggregation: require no modifications to
    customer workloads or the guest OS./ the system
    must be compatible with virtualization acceleration techniques/ the system must be available as
    commodity hardware.

Continue reading "First-generation Memory Disaggregation for Cloud Platforms @Arxiv"

Encountering `::signbit` stuff not passing to `math.h` in MacOS 12.4

TL;DR

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:317:9: error: no member named 'signbit' in the global namespace
using ::signbit;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:318:9: error: no member named 'fpclassify' in the global namespace
using ::fpclassify;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:319:9: error: no member named 'isfinite' in the global namespace
using ::isfinite;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:320:9: error: no member named 'isinf' in the global namespace
using ::isinf;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:321:9: error: no member named 'isnan' in the global namespace
using ::isnan;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:322:9: error: no member named 'isnormal' in the global namespace
using ::isnormal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:323:9: error: no member named 'isgreater' in the global namespace
using ::isgreater;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:324:9: error: no member named 'isgreaterequal' in the global namespace
using ::isgreaterequal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:325:9: error: no member named 'isless' in the global namespace
using ::isless;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:326:9: error: no member named 'islessequal' in the global namespace
using ::islessequal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:327:9: error: no member named 'islessgreater' in the global namespace
using ::islessgreater;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:328:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:329:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^

When I was compiling LLVM recently I found this, it may be because my CommandLineTool is outdated as described in stackoverflow. And I reinstalled it with following code added.

using ::signbit _LIBCPP_USING_IF_EXISTS;
using ::fpclassify _LIBCPP_USING_IF_EXISTS;
using ::isfinite _LIBCPP_USING_IF_EXISTS;
using ::isinf _LIBCPP_USING_IF_EXISTS;
using ::isnan _LIBCPP_USING_IF_EXISTS;
using ::isnormal _LIBCPP_USING_IF_EXISTS;
using ::isgreater _LIBCPP_USING_IF_EXISTS;
using ::isgreaterequal _LIBCPP_USING_IF_EXISTS;
using ::isless _LIBCPP_USING_IF_EXISTS;
using ::islessequal _LIBCPP_USING_IF_EXISTS;
using ::islessgreater _LIBCPP_USING_IF_EXISTS;
using ::isunordered _LIBCPP_USING_IF_EXISTS;
using ::isunordered _LIBCPP_USING_IF_EXISTS;

_LIBCPP_USING_IF_EXISTS is defined as # define _LIBCPP_USING_IF_EXISTS __attribute__((using_if_exists)), simply pass if no defined in the global namespace.

Then the following code output error

using _Lim = numeric_limits<_IntT>;

add another header in

#include <limits>

Then comes to the std::isnan using bypassing no definition error in llvm/lib/Support/NativeFormatting.cpp.

error: expected unqualified-id for std::isnan(N)

just drop the std::

The full formula for riscv-rvv-llvm is located in https://github.com/victoryang00/homebrew-riscv, if anything above happens, do as the above specifies.