2023年终总结

人卷的太快就容易熄火。很深刻的感觉到wdy在我博士刚开的时候对我提到的体力理论,人读博士的过程,是一个体力慢慢下降的过程,而智力的提升是否能跑赢体力是一个很大的问题,一个人Sys & Arch博士很可能就几个代表作,不是因为他做的慢,而是因为确实很多东西做出来了也发不了,或者idea有但是受限于很多因素,比如没有机器(没有connection),没有对特定软件熟悉的人,没有讨喜committee的写作的人恐怕就不要花力气了。是之,人并无自由意志。人不过是圈子里的提线木偶,只不过从一个地方转移到另一个地方而已。尤其是现在芯片其实除了codesign,CXL都是夕阳产业。硬件确实production ready非常非常慢。所以又是军备竞赛,只需要不给你机器就好。另外有一些感触就是合作还是得connection硬,不要找一些不认可自己的人,人格都不稳定的人合作。恐怕所有博士内心都是被拒稿搞得心态炸裂。不过我现在对research的感觉是,有很多东西一辈子也用不上就是发出来探索一下工业界不会去探索的东西,过段时间有用了才会被人拿出来讲,对于很多博士面子和赚的钱容易让人破防,因为实在穷到山穷水尽了。除此之外,我觉得北美PhD圈过于崇尚劣币,过于商业互吹。但是确实paper而已,别太较真,每篇paper能提出还可以接受的东西,community被说服了,结果都是问题。做的东西能用就有鬼了。现在的MLSys、AI4Code更加浮夸,感觉就是把之前系统卷的东西稍微套个皮。我真心觉得,做这些东西拿最低的pay去做创业应该做的东西,是一种成功学校、成功组稳定的薅羊毛方式。

错过了三个2023 ASPLOS,容易永远下次一定。年初的时候想清楚了CXLMemSim,花了一周把东西写出来了。我觉得这玩意挺好玩的,但是当时觉得这套方法过于synthetic没有继续。到最后一个ASPLOS老板才让我继续冲,不过奈何没有机器搞对拍。第一个DDL是SOSP,花了2天帮伊朗同事写一个eBPF的内存分配小工具,她的部分反而应该是两周就能拿出来的,但是就是没有。第二个ASPLOS DDL,我花了2周改了一个可以用procfs改用户static allocation的kernel,内存debug挺困难的,剩下的老板也不让我做了。其实用DSA搞migration是很容易的。测了很多eBPF的性能发现最新的SPR有跨BB栈消除+ROB优化特性导致kretprobe已经很快很快了。不过年初的其他时间都在debug Subpage Write Protection的CXL.cache QEMU,为了一个安全的idea。现在才写的差不多了。感觉现在已经底层到很多架构设计没有编码或者编程难度,而是就是妥妥的hacky debug了。希望接下来一年能多写点rust linux driver,以及多写点MLIR。虽然我感觉我面试接触的几家公司,要能用到有rust的kernel还要数年以后。不过MLIR作为data movement分析框架已经有公司开始卷了。这个时代是一个计算机的黄金时代,我很庆幸我活在这个时代,能贡献一些我的观察和想法。

年中的时候在想webassembly写了3个月,找本科学弟帮忙,他好厉害,比我现在的同事好多了。后面带了一个实习生学习CSAPP。年末云微来找我合作,感觉是被选中的感觉,可能突然发现yyw是菜狗了,我们中间眷恋eBPF+LLM投了FSE,但是现在AI4code已经很多人灌水了,反馈是缺很多和水文章的比较,我觉得有两个问题,老板对eBPF不熟悉,所以老板写很多东西没到点子上。虽然老板人很好,但是说服他很困难。后面bpftime投了Plumber和OSDI。年末摆了一个我真想做的东西被MIRA做了,果然得加把劲努力debug,后来把我所有想的东西投了NV Fellowship,显然是没有中,但是也收获了很多很多的feedback,我觉得太注重名利容易丢失探索的时候的纯真。舒适圈HPC SCC还是有很多进步,虽然上科大从头到尾都非常摆,但是还是帮了一些忙,取得了一些名次。和老婆度过了几乎每一个假期,年初的时候她还没来,暑假和老板请了1周假乘最后一次能出美国去了趟日本。旅游去了SC Denvor和OSDI Boston。老婆来了以后圣诞去了北卡给她买车和去纽约玩。我觉得这一年和老婆的相处日渐成熟,她能接受我的一切了,希望能共度余生。

[CSE211 Reading] MLIRSynth framework

  1. Motivation:
    • Problem Statement
      • The current compilation phase for heterogeneous devices like CPU GPU or TPU is too divergent and not high performance because of the lack of semantic translation when lowering the IR.
      • image-20231112081221616
      • MLIR is an infrastructure for developing domain-specific compilers. To aid this, MLIR provides reusable building blocks, especially the abstraction of dialects with a bunch of operators that have knowledge of cross-device memory communication and predefined and shared tools that allow us to define domain-specific languages and their compilation pipelines.
    • SoTA
      • Polygest has already filled what's done for multi-dimensional dialects like Affine IR, where we normally do auto-CGRA/GPU/TPU code generation.
      • Google's HLO can do XLA just like Jax or Chris Latner's Mojo is doing
      • LLVM Polly can do backend compilation with very good performance insight for a single machine.
      • Linalg IR (By the way Linear Algebra extensions have been accepted from the C++26 community that maps the header to this primitive IR) has the insight from the mathematical view to transform the matmul and transpose to be only one-time transpose. (together with many other mathematical optimizations) and has the best insight to clear away the linear algebra residual dead primitive.
    • Motivation
    • image-20231112080422281
      • LLVM IR/Affine IR/Linalg IR are too heterogeneous in different ways. Sa HLO is a better way of raising from C++ to ML DSLs that is super useful for TPUs. Taking from the uniformed IR to a divergent but idompediency in terms of dataflow (especially IO) and semantic to easily codegen to different dialects are super useful for current development for the compiler to TPU/GPU/CPU extensions.
      • For raising and lowering, it is actually impossible to embed the same logic with no information loss. Say, I'm writing the predefined functions for an application, For cross-platform optimization in MLIR is good for memory translation and compliance to different targets' views from a data movement perspective if you are lowering to XLA.
      • For the impossible dimensions for compatibility, debug information and performance insight, Say, a library that's been optimized may be nonsense from a lowered IR perspective. If you don't have the knowledge of both IR, it's stupid abstraction.
      • DataFlow may be completely wrong, so we need a residual IO spec generator to maintain the idompediency.
      • Compared with HAILE and Fortran MLIR, we require a lot of functionality wise upgrades.
  2. Compiler Solution: The MLIRSynth Framework - A virtual Compilation Phase abstraction:
image-20231112081336639

Heuristics: candidate set for getting the phi instruction out to match the set between target dialect and source dialect.

Soundiness: CBMC Z3 for determining the correctness statically.

To extract φ (phi) with the candidate set, the algorithm follows a bottom-up synthesis approach. Here is a summary of the process:

  1. Initialization: The algorithm starts by creating a candidate set (C) with valid candidates that produce distinct values from each other. This set includes candidates that return the arguments of the reference function (f) and simple constants.
  2. Enumeration: The algorithm iterates through the set of operations in the grammar. For each operation, it identifies sets of possible operands, attributes, and regions based on the operation signature.
  3. Candidate Generation: The algorithm generates possible candidates by taking the Cartesian product of sets of operands, attributes, and regions.
  4. Candidate Checking: Each candidate in the set is validated using a series of static checks, ordered by complexity. These checks include type correctness and additional checks via dialects verification interface.
  5. Equivalence Pruning and Validation: If the static checks succeed, the algorithm uses MLIR's execution engine to compile the candidate. It then checks φobsn by executing the candidate program (f') on a set of inputs and comparing the output value with the output value produced by the reference function (f).
  6. Specification Checking: The algorithm checks if the candidate satisfies the specifications φ_{obsn} and φ_{obsN} by comparing the outputs of the candidate and the reference function on a small finite set of inputs (In) and a large finite set of inputs (IN), respectively.
  7. Illustrative Example:
image-20231116104246652

The above is from one dialect to the other

  1. Key Results:
image-20231116105913960
image-20231116110116097

The evaluation over PollyBench on 8700k and 3990x summary tells us about the performance and effectiveness of the mlirSynth algorithm in raising programs to higher-level dialects within MLIR. The TPU performs well over LLVM for all(because LLVM is not a good IR for heterogenous accelerator) It provides information on the synthesis time, validity checks, and the impact of type information and candidate pruning on the synthesis process. The summary also mentions the performance improvement achieved by the raised programs compared to existing compilation flows, as well as the potential for further improvements and future work.

  1. Discussion and Future Directions:
  • Benefits:
    • The bottom-up enumerative synthesis approach in MLIR allows for raising dialect levels within the MLIR framework.
    • The retargetable approach is applied to Affine IR, raising it to the Linalg and HLO IRs.
    • The raised IR code, when compiled to different platforms, outperforms existing compilation flows in terms of performance.
    Implications:
    • The use of polyhedral analysis in the compilation community has been extensively explored, but MLIR-Synth offers a different approach by using polyhedral analysis to raise dialect levels instead of lowering code.
    • The synthesis process in MLIR-Synth involves type filtering, candidate evaluation, and equivalence checking, which significantly reduces synthesis time compared to a naive algorithm.
    Future Work:
    • The authors plan to raise programs to multiple target dialects and improve the synthesis search by reusing previous program space explorations.
    • They also aim to integrate model checking into the synthesis process and evaluate raising to new and emerging dialects of MLIR.
    • The scalability of the synthesis algorithm will be improved to handle larger benchmark suites.
  • The middle IR is always there for sure that is easier been developed from different angle but it's not the killer app for giving a new tool. The speed up from the tool is basically the backend that already has.

最近的multiverse debug历程

我同事这人过去六个月的进步不如上一周的进步。因为yyw在旁边点醒了他。

首先static instrument把启动位置和offset弄了,把instrument code的basic block都加个offset。把jmp map metadata放最后,第一个问题是在6.2.0-36 kernel和5.4.0行为不一样,rtdl.c:1306会segfault原因是p_header在两个kernel里的physical和virtual address的mapping行为不一样。

其次spec17 gcc有selfmodifying code我们不支持。

其次perl在destructor会炸,因为对dtor的libc处理很傻逼,也是一个selfmodifying code实现。

推荐阅读

有关yyw眼中的性别

众所周知,yyw眼中第一重要的是事业,第二重要的是老婆,第三重要的才是无尽的小裙子。但突然发现后两个好像并没有区别,把头发剃了和以前没啥区别,也可以戴假发穿小裙子,我也不是非常在意身体,以为我觉得对身体损伤最小的变性方式并没有出现,所以不会让自己过于陷于不受控制的境地。
Steve Jobs能在胰腺癌移植手术后继续苟活6年,yyw也能盼望这个世界能早日医疗进步,今天的每一天和昨天都不一样,尤其是在这个能最大限度释放yyw所想的加州。
所以我认为任何意义上的性别都不重要,重要的是唯心和以一种舒服的方式生活下去,如果我的皮肤能保持光滑的女性特质,这就足够装下yyw的灵魂了。我认为没有解决一个问题,纯粹是因为你没有站在更高的位面观测这个世界,并不是这个世界不存在,我尝试用激烈的测试手段来观测这个世界来让我达到一个更高的位面,但是同时又希望身边的人能获益。
所以老婆要yyw怎么样,都可以,只要她不离开yyw且同意给yyw生小孩。

GPU Slicing Proposal Using CXLMemUring

The current GPU slicing by MIG is way to higher granularity.

The state of the art needs GPU on getting the service mesh requests letting you serve it, normally the launch time of the kernel and the data movement takes the most of the execution. Pre execution by MLIR JIT statically optimize out the launch but do GPU context coroutine.

[CSE211 Reading] UNIT framework

  • Motivation:
    • Problem Statement
      • With the advent and proliferation of Deep Neural Networks (DNNs), there's a significant increase in computational demands. Managing these demands efficiently is crucial for the performance and scalability of DNNs.
      • Traditional compilation processes for tensorized instructions, which are central to DNN computations, can be cumbersome and may not fully exploit hardware and software optimizations.
      • To address the computational and memory challenges by unifying the compilation process for tensorized instructions. A unified approach can lead to better utilization of available resources and ease the integration of new instructions.
    • Motivation
      • Reducing multiple low precision to high
        • Horizontal reduction
        • Mixed precision
      • Software abstraction of the kernel library for tensorized instructions that can be deployed on intel VNNI, ARM SVE, and NVIDIA Tensor Cores. Possibly for AMX now. Nowadays, they only have library compliance for tensorized instructions each time they upgrade the GPU/TPU/ASIC.
      • Alt text
  • Compiler Solution - The UNIT Framework - A virtual ISA abstraction: Alt text
    • Framework Overview
      • The authors propose a compiler framework called UNIT to unify the compilation for tensorized instructions.
      • At the heart of UNIT is a unified semantics abstraction which simplifies the integration of new instructions and allows for the reuse of analysis and transformations.
    • Key Features
      • Ease of Integration: The framework is designed to simplify the integration of new tensorized instructions.
      • Reuse of Analysis and Transformations: By adopting a unified semantics abstraction, the framework facilitates the reuse of analysis and transformations, promoting efficiency.
      • Translation of Memory Operations: The framework translates memory operations to tensorized instructions and do optimizations over this.
    • Mixed Precision Data Types
      • A notable approach to reducing computational and memory burden is the use of mixed precision data types. This approach is widely adopted and is integrated within the UNIT framework.
    • Mixed Sized Data Types
      • The authors also propose the use of mixed sized data types to reduce the computational and memory burden. This approach is also widely adopted and is integrated within the UNIT framework.
  • Illustrative Example: Alt text Alt text We first use the Arithmetic Isomorphism for a single thread using split/tile reorder and unroll. Alt text Then we have the Memory Isomorphism for intel VNNI, basically mark the loop invariant and lower to the other precision and other sized tensorized operations. Alt text Alt text Alt text Finally we do the transformation for Loop Recorganization for registers Alt text
  • Implementation and Key Results:
    • The paper likely provides an implementation of the UNIT framework and evaluates its performance against traditional compilation processes. (Note: The actual implementation and evaluation details would be in the paper which wasn't accessible during the research.) Alt text Alt text
  • Discussion and Future Directions:
    • Benefits
      • The UNIT framework, as per the authors, presents a viable solution to the challenges of compiling tensorized instructions efficiently, thereby addressing a critical aspect of DNN performance optimization.
    • Implications
      • The proposed framework could have broader implications for the field of DNNs, particularly in how tensor computations are handled in both hardware and software domains.
    • Future Work
      • The paper may suggest directions for future work to further enhance the UNIT framework or explore other optimizations in tensorized instruction compilation.
  • Conclusion and Comments:
    • The UNIT framework emerges as a significant contribution towards optimizing the compilation process for tensorized instructions, addressing the computational and memory challenges associated with DNNs.
    • We need a vISA for dynamically JIT the vISA will be great for migrating the same code to different hardware rather than library compliance.

ExaScale: Rethinking Von Neumann for modern GPU compared with DSA(TPU, CGRA, Like-brain, PIM)

The current trend of developing AI accelerators is not following Von Neumann's view. What was Von Neumann's outcome? Multi-tenancy, virtualization, fine-grained scheduling, mapping back to the compiler, and cross-platform live migration. Why is this property deprecated in a lot of so-called Von Neumann Architecture? It's because the current microarchitecture state is too complicated to fully manifest for the programmers to understand, which cancels out a lot of people's interests. I think Professor Jiang Yanyan's abstraction of the operating system as an automaton is incorrect because of the explosion of the transparent state to OS; the GPU is not fully debuggable, letting along other coarse-grained architecture in TPU. So if you couldn't fine-tune your scheduling, the outcome is if the workload is constantly changing, your chip and infrastructure will never beat Nvidia because they have better infrastructure, and TFLOPS is close to the extreme of what any chip can do. Tomorrow, if I want to deploy LLM+HPC, all the DSA will just die because of this. I think the abstraction of CUDA or the abstraction of C++ language level is good for programmers to program, but far more deviated from the Von Neumann property all done. If other academic proposals want to commercialize either one of the DSAs, like TPU, CGRA, Like-brain, or PIM, they might lose any of the above Von Neumann properties and won't be useful if those architectures don't have the 10x speedup and agility that CUDA and GPU provide.

In terms of virtualization, GPUs are never ready for virtualization because the current virtualization techniques on GPUs are still VFIO, which is CPU-dominated and slow. Ray, as I mentioned before, has a Von Neumann Memory Wall, and epoch-based is not fine-grain granularity. and we should never adapt to the front end like PyTorch or CUDA because it doesn't change anything in the meaningless abstraction or working for monopoly; we need a revolution from the architecture and back to the abstract to the language. We need to go back to normal; why did we lose this property? In the realm of modern GPU architectures, there's an emerging sentiment: as long as we utilize CUDA's Just-In-Time (JIT) compilation capabilities, we can achieve a faster Virtual Instruction Set Architecture (ISA)—for instance, something akin to WebGPU/Vulkan/ptx. This could lead to virtualization speeds surpassing traditional methods like VFIO with no semantic or performance sacrifice.

Am I saying DSA is no longer useful? No, if everything in a space is very mature, I guess the DSA will eventually win, but things change every day. Speculative decoding is mathematically the same as whole decoding, saving your training set 10 times, so your TPU is not agile enough to tailor to this change, but your GPU can quickly adopt the new math advancement. TPU has an inference market. If Google Gemini is going to take everybody in the next month, the TPU behind it will be very money-saving in terms of the electricity cost, which only Google can do in the entire universe. Other technologies, like CGRA or like-brain technology, are unsolvable in the near future.

ExaScale aims to beat Nvidia, not by breaking the monopoly that Nvidia has, but first by making transparent migration over different GPUs and connecting them within a memory pool that is not Nvidia's alone. This will facilitate price competition because Nvidia will no longer have competitive edges. The second is to hack the interconnect through CXL or another faster fabric that beats NVLink with software hardware codesign like CXLMemUring. I guess this movement will be the future of how we integrate everything!