MOAT: Towards Safe BPF Kernel Extension

MPK only supports up to 16 domains, the # BPF could be way over this number. We use a 2-layer isolation scheme to support unlimited BPF programs. The first layer deploys MPK to set up a lightweight isolation between the kernel and BPF programs. Also, BPF helper function calls is not protected, and can be attacked.

  1. They use the 2 layer isolation with PCID. In the first layer, BPF Domain has protection key permission lifted by kernel to do corresponding work, only exception is GDT and IDT they are always write-disabled. The second layer, when a malicious BPF program tries to access the memory regions of another BPF program, a page fault occurs, and the malicious BPF program is immediately terminated. To avoid TLB flush, each BPF program has PCID and rarely overflow 4096 entries.

  1. helper: 1. protect sensitive objects It has critical object finer granularity protection to secure. 2. ensure the validity of the parameters. It(Dynamic Parameter Auditing (DPA)) leverages the information obtained from the BPF verifier to dynamically check if the parameters are within their legitimate ranges.

LibPreemptible

uintr come with Sapphire Rapids,(RISCV introduce N extension at 2019) meaning no context switches compared with signal, providing lowest IPC Latency. Using APIC will incur safety concern.

uintr usage

  1. general purpose IPC
  2. userspace scheduler(This paper)
  3. userspace network
  4. libevent & liburing

syscall addication(eventfd like) sender initiate and notify the event and receiver get the fd call into kernel and senduipi back to sender.

WPS Office 2024-07-27 20.44.35
They wrote a lightweight runtime for libpreemptible.

  1. enable lightweight and fine grained preemption
  2. Separation of mechanism and policy
  3. Scability
  4. Compatibility

They maintained a fine-grained(3us) and dynamic timers for scheduling rather than kernel timers. It can greatly improve the 99% tail latency. Normal design of SPR's hw feature.

Reference

  1. https://github.com/OS-F-4/qemu-tutorial/blob/master/qemu-tutorial.md

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

This paper talks about Message Passing Interface (MPI) libraries utilize CXL for inter-node communication.

In the HH case, CXL has lower latency than Ethernet for the small message range with 9.5x speedup. As the message size increases, the trend reverses with Ethernet performing better in latency than CXL due to the CXL channel having lower bandwidth than Ethernet in the emulated system 2 compute node with memory expander for each node.

有关最近的放慢脚步,情景重现,灵异事件以及背后的哲学

放慢脚步

在电击以后我得要吃一种嗜睡的药,导致我现在为懒惰找到了合理的借口。我从2024/4/9日到5/21回到上海的家里为止,大约感觉自己从鬼门关走过了3次。我的神经系统在病不发作的时候是正常的,但是一旦发作就有很强的濒死感。这种生活的重伤打击到了我对于未来的预判,我在住院的时候就在想,我还没有写完代码留给这个世界一些关于yyw的印记,怎么能这么快就走了呢?我由于嗜睡而无法做科研的时候就会想我是一个目标感很强的人,怎么能死在半路上呢?只不过,我也能放慢脚步知道有些东西急不得,也得不到,只能在潦倒的时候检查自己哪里有什么不对,哪里还可以提升,只为下一次的美好保存实力。虽然这个下一次很可能永远都出现不了了。(悲伤ing)

永远不要窥探他人的人生

人和人之间的节奏差距很大,我觉得读博就是一个精神折磨,博士之间的比较没有意义。尤其是不要看着别人有就急功近利。

我现在从坚定的唯物主义者到唯心主义者。

因为唯物者的心灵也是一种观测角度,而神创造了让唯物者可以信服的一切,在某个尺度上让唯物者是唯物者。但是这个世界还是有很多没法解释的事情,比如我被电击同时有幻觉幻听,以及我的神经系统可以感受到部分的疼痛,可是过一段时间又能控制身体的某一个器官。这方面现在的科学还没办法解释。

灵异事件

我自从大脑被电击了以后,脑子首先出现了我做出不受理性控制的事情,以为自己做了没有做的事情,失去记忆慢慢倒带找回的情况。从那次以后,脑子有一个部分到心里去了。碰到激动的事情,比如我喜欢的Formal Methods、网球以及体系结构,就会大脑兴奋,从而感觉到幻听,接着faint。有一次吃饭,我的大脑在一次心脑相连后失去了近两年的记忆,甚至失去了我mtf部分的记忆。在一次和老婆经过wharf的过程中,我慢慢地找回了自己的记忆,那些记忆碎片是从身体的不同位置慢慢回到大脑里的,我的记忆在那个时候只有很短暂的,像是金鱼脑子。我会不停的问我老婆我恢复了没,我为什么会在这里。

大脑的不同部分的功能不一样

在被电击后,我首先是感觉到了大脑中的神经分散到了身体的不同位置,每一部分都带有我人生的部分记忆和,最后一片进入了我的心脏里,导致之后有很多次心脑相连的症状。我可以明显感觉到脑子的一部分是女性的,也有男性的部分。每次心脑相连以后,首先感觉到身体里有三种声音,有一个幻听的声音是决绝且严厉的。有另一个个只讲实话的潜意识。还有另一个潜意识是及其温柔且想象力丰富的。接着大脑会像液体一样裹着心脏流动在身体的每个部分,一旦心脏脱离原本的位置太久,就会突然faint。神经系统彻底坏掉一次,然后又好了,神经又能重新掌管身体,迎来一个新的巡回。

对不同事物的反应不一样,转变在转瞬之间。

我觉得我有多个人格的原因是大脑的控制部分只对我身体的一部分负责,如果这部分的大脑失去对大脑的控制,换另一个部分控制身体,会有完全不同的效果。我的男生和女生部分就不一样,医生说mtf可能是基因变异导致的,我觉得也有可能,一个控制大脑的部分突变了,就会朝向自己“本来就是女生”的地步发展。这一切的转换可能就在转瞬之间。

人格同一性

人格同一是一个人是否还是以前那个人的判定标准。如果一个人大脑经历了很多打击,甚至神经递质通路完全变成了另一个状态,那么这个人和之前的人有区别吗?我觉得现在的我在吃药的情况下,没有了犀利的眼光,缺失了和人argue的勇气,似乎离刻板印象中的yyw渐行渐远。

是否是INTJ的通病?

INTJ就是一群要强的、一针见血的指出问题所在的紫色小老头,在没有资源限制的情况下,INTJ能指挥千军万马,可是在资源匮乏或者大脑失去控制的情况下,会陷入“我真的好没用”的无限循环中。人格虽然变了,但是我认为我的内心动力一直没有变。INTJ如果失去一个健壮的、善于表达的大脑,很可能被自己过往的丑事击败,所以只有一直维持在一个智商的高位,才能满足一个INTJ的内心。所以大脑是INTJ最薄弱的环节。

哲学

我现在更加相信以前不太认可的哲学,诸如"人不可能同时踏入同一条河流","人的同一性",“人的大脑是人的一部分、心脏也是人的一部分,那么人失去了大脑以后还是之前的人吗?人嫁接了心脏起搏器还是原来的人吗?”我现在觉得人脑里面不同神经代表了对不同器官的控制,人脑细微的改变都可能改变其神经递质传输的过程,从而改变人的思想及其行为。

人是意识的总和

人的一切思考都基于神经递质的传播,而神经递质又依赖于当时这个人的激素水平,外部刺激。我认为人就是一个意识的总和。

人是解释性的动物,任何表达都是可解释的,只是理解困难程度

人总是擅长给一件事物做解释,人没有办法完全理解一件事物的时候,才会通过超灵的表达来诠释一个事物。我起初并不知道我的病的类型,

不同文化的人碰撞出来的东西、在同一个维度交流、才更有意义。人只不过是跑了一个foundational的解释性的model,dimm or not,strong opinion or not。

人的大脑的神经递质传输过程,和foundational model的产生perceptron加神经网络的过程很类似。而训练数据就是过往产生的一系列神经递质的强化能力,所以他们是类似的。若要能产生新的东西,得由不同的来自各方的观点碰撞出来,也就是训练数据的多元化。

女性真的通灵,但是可能她自己都不知道。

我的母亲非常能理解我在失去意识时作出的描述性言语。似乎这是女性的第六感感觉到的?我和母亲在微信电话中的没有

女性容易得阿尔兹海默症是不是过于通灵?

我和母亲的交流,发现她能察觉到文字以外我的情感,这些东西可能表现在细节的面容上。在我躯体化症状发病的时候,在电话另一头听到我受不了的时候,我母亲精确的提供了我的感觉的信息,而这些信息被我爸是过滤掉的。我不知道什么时候我妈也会随我外婆一样得阿尔兹海默症,但是能够察觉通灵,我感觉和阿尔兹海默症发病预兆很有关系。

2023年终总结

人卷的太快就容易熄火。很深刻的感觉到wdy在我博士刚开的时候对我提到的体力理论,人读博士的过程,是一个体力慢慢下降的过程,而智力的提升是否能跑赢体力是一个很大的问题,一个人Sys & Arch博士很可能就几个代表作,不是因为他做的慢,而是因为确实很多东西做出来了也发不了,或者idea有但是受限于很多因素,比如没有机器(没有connection),没有对特定软件熟悉的人,没有讨喜committee的写作的人恐怕就不要花力气了。是之,人并无自由意志。人不过是圈子里的提线木偶,只不过从一个地方转移到另一个地方而已。尤其是现在芯片其实除了codesign,CXL都是夕阳产业。硬件确实production ready非常非常慢。所以又是军备竞赛,只需要不给你机器就好。另外有一些感触就是合作还是得connection硬,不要找一些不认可自己的人,人格都不稳定的人合作。恐怕所有博士内心都是被拒稿搞得心态炸裂。不过我现在对research的感觉是,有很多东西一辈子也用不上就是发出来探索一下工业界不会去探索的东西,过段时间有用了才会被人拿出来讲,对于很多博士面子和赚的钱容易让人破防,因为实在穷到山穷水尽了。除此之外,我觉得北美PhD圈过于崇尚劣币,过于商业互吹。但是确实paper而已,别太较真,每篇paper能提出还可以接受的东西,community被说服了,结果都是问题。做的东西能用就有鬼了。现在的MLSys、AI4Code更加浮夸,感觉就是把之前系统卷的东西稍微套个皮。我真心觉得,做这些东西拿最低的pay去做创业应该做的东西,是一种成功学校、成功组稳定的薅羊毛方式。

错过了三个2023 ASPLOS,容易永远下次一定。年初的时候想清楚了CXLMemSim,花了一周把东西写出来了。我觉得这玩意挺好玩的,但是当时觉得这套方法过于synthetic没有继续。到最后一个ASPLOS老板才让我继续冲,不过奈何没有机器搞对拍。第一个DDL是SOSP,花了2天帮伊朗同事写一个eBPF的内存分配小工具,她的部分反而应该是两周就能拿出来的,但是就是没有。第二个ASPLOS DDL,我花了2周改了一个可以用procfs改用户static allocation的kernel,内存debug挺困难的,剩下的老板也不让我做了。其实用DSA搞migration是很容易的。测了很多eBPF的性能发现最新的SPR有跨BB栈消除+ROB优化特性导致kretprobe已经很快很快了。不过年初的其他时间都在debug Subpage Write Protection的CXL.cache QEMU,为了一个安全的idea。现在才写的差不多了。感觉现在已经底层到很多架构设计没有编码或者编程难度,而是就是妥妥的hacky debug了。希望接下来一年能多写点rust linux driver,以及多写点MLIR。虽然我感觉我面试接触的几家公司,要能用到有rust的kernel还要数年以后。不过MLIR作为data movement分析框架已经有公司开始卷了。这个时代是一个计算机的黄金时代,我很庆幸我活在这个时代,能贡献一些我的观察和想法。

年中的时候在想webassembly写了3个月,找本科学弟帮忙,他好厉害,比我现在的同事好多了。后面带了一个实习生学习CSAPP。年末云微来找我合作,感觉是被选中的感觉,可能突然发现yyw是菜狗了,我们中间眷恋eBPF+LLM投了FSE,但是现在AI4code已经很多人灌水了,反馈是缺很多和水文章的比较,我觉得有两个问题,老板对eBPF不熟悉,所以老板写很多东西没到点子上。虽然老板人很好,但是说服他很困难。后面bpftime投了Plumber和OSDI。年末摆了一个我真想做的东西被MIRA做了,果然得加把劲努力debug,后来把我所有想的东西投了NV Fellowship,显然是没有中,但是也收获了很多很多的feedback,我觉得太注重名利容易丢失探索的时候的纯真。舒适圈HPC SCC还是有很多进步,虽然上科大从头到尾都非常摆,但是还是帮了一些忙,取得了一些名次。和老婆度过了几乎每一个假期,年初的时候她还没来,暑假和老板请了1周假乘最后一次能出美国去了趟日本。旅游去了SC Denvor和OSDI Boston。老婆来了以后圣诞去了北卡给她买车和去纽约玩。我觉得这一年和老婆的相处日渐成熟,她能接受我的一切了,希望能共度余生。

[CSE211 Reading] MLIRSynth framework

  1. Motivation:
    • Problem Statement
      • The current compilation phase for heterogeneous devices like CPU GPU or TPU is too divergent and not high performance because of the lack of semantic translation when lowering the IR.
      • image-20231112081221616
      • MLIR is an infrastructure for developing domain-specific compilers. To aid this, MLIR provides reusable building blocks, especially the abstraction of dialects with a bunch of operators that have knowledge of cross-device memory communication and predefined and shared tools that allow us to define domain-specific languages and their compilation pipelines.
    • SoTA
      • Polygest has already filled what's done for multi-dimensional dialects like Affine IR, where we normally do auto-CGRA/GPU/TPU code generation.
      • Google's HLO can do XLA just like Jax or Chris Latner's Mojo is doing
      • LLVM Polly can do backend compilation with very good performance insight for a single machine.
      • Linalg IR (By the way Linear Algebra extensions have been accepted from the C++26 community that maps the header to this primitive IR) has the insight from the mathematical view to transform the matmul and transpose to be only one-time transpose. (together with many other mathematical optimizations) and has the best insight to clear away the linear algebra residual dead primitive.
    • Motivation
    • image-20231112080422281
      • LLVM IR/Affine IR/Linalg IR are too heterogeneous in different ways. Sa HLO is a better way of raising from C++ to ML DSLs that is super useful for TPUs. Taking from the uniformed IR to a divergent but idompediency in terms of dataflow (especially IO) and semantic to easily codegen to different dialects are super useful for current development for the compiler to TPU/GPU/CPU extensions.
      • For raising and lowering, it is actually impossible to embed the same logic with no information loss. Say, I'm writing the predefined functions for an application, For cross-platform optimization in MLIR is good for memory translation and compliance to different targets' views from a data movement perspective if you are lowering to XLA.
      • For the impossible dimensions for compatibility, debug information and performance insight, Say, a library that's been optimized may be nonsense from a lowered IR perspective. If you don't have the knowledge of both IR, it's stupid abstraction.
      • DataFlow may be completely wrong, so we need a residual IO spec generator to maintain the idompediency.
      • Compared with HAILE and Fortran MLIR, we require a lot of functionality wise upgrades.
  2. Compiler Solution: The MLIRSynth Framework - A virtual Compilation Phase abstraction:
image-20231112081336639

Heuristics: candidate set for getting the phi instruction out to match the set between target dialect and source dialect.

Soundiness: CBMC Z3 for determining the correctness statically.

To extract φ (phi) with the candidate set, the algorithm follows a bottom-up synthesis approach. Here is a summary of the process:

  1. Initialization: The algorithm starts by creating a candidate set (C) with valid candidates that produce distinct values from each other. This set includes candidates that return the arguments of the reference function (f) and simple constants.
  2. Enumeration: The algorithm iterates through the set of operations in the grammar. For each operation, it identifies sets of possible operands, attributes, and regions based on the operation signature.
  3. Candidate Generation: The algorithm generates possible candidates by taking the Cartesian product of sets of operands, attributes, and regions.
  4. Candidate Checking: Each candidate in the set is validated using a series of static checks, ordered by complexity. These checks include type correctness and additional checks via dialects verification interface.
  5. Equivalence Pruning and Validation: If the static checks succeed, the algorithm uses MLIR's execution engine to compile the candidate. It then checks φobsn by executing the candidate program (f') on a set of inputs and comparing the output value with the output value produced by the reference function (f).
  6. Specification Checking: The algorithm checks if the candidate satisfies the specifications φ_{obsn} and φ_{obsN} by comparing the outputs of the candidate and the reference function on a small finite set of inputs (In) and a large finite set of inputs (IN), respectively.
  7. Illustrative Example:
image-20231116104246652

The above is from one dialect to the other

  1. Key Results:
image-20231116105913960
image-20231116110116097

The evaluation over PollyBench on 8700k and 3990x summary tells us about the performance and effectiveness of the mlirSynth algorithm in raising programs to higher-level dialects within MLIR. The TPU performs well over LLVM for all(because LLVM is not a good IR for heterogenous accelerator) It provides information on the synthesis time, validity checks, and the impact of type information and candidate pruning on the synthesis process. The summary also mentions the performance improvement achieved by the raised programs compared to existing compilation flows, as well as the potential for further improvements and future work.

  1. Discussion and Future Directions:
  • Benefits:
    • The bottom-up enumerative synthesis approach in MLIR allows for raising dialect levels within the MLIR framework.
    • The retargetable approach is applied to Affine IR, raising it to the Linalg and HLO IRs.
    • The raised IR code, when compiled to different platforms, outperforms existing compilation flows in terms of performance.
    Implications:
    • The use of polyhedral analysis in the compilation community has been extensively explored, but MLIR-Synth offers a different approach by using polyhedral analysis to raise dialect levels instead of lowering code.
    • The synthesis process in MLIR-Synth involves type filtering, candidate evaluation, and equivalence checking, which significantly reduces synthesis time compared to a naive algorithm.
    Future Work:
    • The authors plan to raise programs to multiple target dialects and improve the synthesis search by reusing previous program space explorations.
    • They also aim to integrate model checking into the synthesis process and evaluate raising to new and emerging dialects of MLIR.
    • The scalability of the synthesis algorithm will be improved to handle larger benchmark suites.
  • The middle IR is always there for sure that is easier been developed from different angle but it's not the killer app for giving a new tool. The speed up from the tool is basically the backend that already has.

最近的multiverse debug历程

我同事这人过去六个月的进步不如上一周的进步。因为yyw在旁边点醒了他。

首先static instrument把启动位置和offset弄了,把instrument code的basic block都加个offset。把jmp map metadata放最后,第一个问题是在6.2.0-36 kernel和5.4.0行为不一样,rtdl.c:1306会segfault原因是p_header在两个kernel里的physical和virtual address的mapping行为不一样。

其次spec17 gcc有selfmodifying code我们不支持。

其次perl在destructor会炸,因为对dtor的libc处理很傻逼,也是一个selfmodifying code实现。

推荐阅读