[CSE211 Reading] MLIRSynth framework

  1. Motivation:
    • Problem Statement
      • The current compilation phase for heterogeneous devices like CPU GPU or TPU is too divergent and not high performance because of the lack of semantic translation when lowering the IR.
      • image-20231112081221616
      • MLIR is an infrastructure for developing domain-specific compilers. To aid this, MLIR provides reusable building blocks, especially the abstraction of dialects with a bunch of operators that have knowledge of cross-device memory communication and predefined and shared tools that allow us to define domain-specific languages and their compilation pipelines.
    • SoTA
      • Polygest has already filled what's done for multi-dimensional dialects like Affine IR, where we normally do auto-CGRA/GPU/TPU code generation.
      • Google's HLO can do XLA just like Jax or Chris Latner's Mojo is doing
      • LLVM Polly can do backend compilation with very good performance insight for a single machine.
      • Linalg IR (By the way Linear Algebra extensions have been accepted from the C++26 community that maps the header to this primitive IR) has the insight from the mathematical view to transform the matmul and transpose to be only one-time transpose. (together with many other mathematical optimizations) and has the best insight to clear away the linear algebra residual dead primitive.
    • Motivation
    • image-20231112080422281
      • LLVM IR/Affine IR/Linalg IR are too heterogeneous in different ways. Sa HLO is a better way of raising from C++ to ML DSLs that is super useful for TPUs. Taking from the uniformed IR to a divergent but idompediency in terms of dataflow (especially IO) and semantic to easily codegen to different dialects are super useful for current development for the compiler to TPU/GPU/CPU extensions.
      • For raising and lowering, it is actually impossible to embed the same logic with no information loss. Say, I'm writing the predefined functions for an application, For cross-platform optimization in MLIR is good for memory translation and compliance to different targets' views from a data movement perspective if you are lowering to XLA.
      • For the impossible dimensions for compatibility, debug information and performance insight, Say, a library that's been optimized may be nonsense from a lowered IR perspective. If you don't have the knowledge of both IR, it's stupid abstraction.
      • DataFlow may be completely wrong, so we need a residual IO spec generator to maintain the idompediency.
      • Compared with HAILE and Fortran MLIR, we require a lot of functionality wise upgrades.
  2. Compiler Solution: The MLIRSynth Framework - A virtual Compilation Phase abstraction:

Heuristics: candidate set for getting the phi instruction out to match the set between target dialect and source dialect.

Soundiness: CBMC Z3 for determining the correctness statically.

To extract φ (phi) with the candidate set, the algorithm follows a bottom-up synthesis approach. Here is a summary of the process:

  1. Initialization: The algorithm starts by creating a candidate set (C) with valid candidates that produce distinct values from each other. This set includes candidates that return the arguments of the reference function (f) and simple constants.
  2. Enumeration: The algorithm iterates through the set of operations in the grammar. For each operation, it identifies sets of possible operands, attributes, and regions based on the operation signature.
  3. Candidate Generation: The algorithm generates possible candidates by taking the Cartesian product of sets of operands, attributes, and regions.
  4. Candidate Checking: Each candidate in the set is validated using a series of static checks, ordered by complexity. These checks include type correctness and additional checks via dialects verification interface.
  5. Equivalence Pruning and Validation: If the static checks succeed, the algorithm uses MLIR's execution engine to compile the candidate. It then checks φobsn by executing the candidate program (f') on a set of inputs and comparing the output value with the output value produced by the reference function (f).
  6. Specification Checking: The algorithm checks if the candidate satisfies the specifications φ_{obsn} and φ_{obsN} by comparing the outputs of the candidate and the reference function on a small finite set of inputs (In) and a large finite set of inputs (IN), respectively.
  7. Illustrative Example:

The above is from one dialect to the other

  1. Key Results:

The evaluation over PollyBench on 8700k and 3990x summary tells us about the performance and effectiveness of the mlirSynth algorithm in raising programs to higher-level dialects within MLIR. The TPU performs well over LLVM for all(because LLVM is not a good IR for heterogenous accelerator) It provides information on the synthesis time, validity checks, and the impact of type information and candidate pruning on the synthesis process. The summary also mentions the performance improvement achieved by the raised programs compared to existing compilation flows, as well as the potential for further improvements and future work.

  1. Discussion and Future Directions:
  • Benefits:
    • The bottom-up enumerative synthesis approach in MLIR allows for raising dialect levels within the MLIR framework.
    • The retargetable approach is applied to Affine IR, raising it to the Linalg and HLO IRs.
    • The raised IR code, when compiled to different platforms, outperforms existing compilation flows in terms of performance.
    • The use of polyhedral analysis in the compilation community has been extensively explored, but MLIR-Synth offers a different approach by using polyhedral analysis to raise dialect levels instead of lowering code.
    • The synthesis process in MLIR-Synth involves type filtering, candidate evaluation, and equivalence checking, which significantly reduces synthesis time compared to a naive algorithm.
    Future Work:
    • The authors plan to raise programs to multiple target dialects and improve the synthesis search by reusing previous program space explorations.
    • They also aim to integrate model checking into the synthesis process and evaluate raising to new and emerging dialects of MLIR.
    • The scalability of the synthesis algorithm will be improved to handle larger benchmark suites.
  • The middle IR is always there for sure that is easier been developed from different angle but it's not the killer app for giving a new tool. The speed up from the tool is basically the backend that already has.

最近的multiverse debug历程


首先static instrument把启动位置和offset弄了,把instrument code的basic block都加个offset。把jmp map metadata放最后,第一个问题是在6.2.0-36 kernel和5.4.0行为不一样,rtdl.c:1306会segfault原因是p_header在两个kernel里的physical和virtual address的mapping行为不一样。

其次spec17 gcc有selfmodifying code我们不支持。

其次perl在destructor会炸,因为对dtor的libc处理很傻逼,也是一个selfmodifying code实现。


Steve Jobs能在胰腺癌移植手术后继续苟活6年,yyw也能盼望这个世界能早日医疗进步,今天的每一天和昨天都不一样,尤其是在这个能最大限度释放yyw所想的加州。

GPU Slicing Proposal Using CXLMemUring

The current GPU slicing by MIG is way to higher granularity.

The state of the art needs GPU on getting the service mesh requests letting you serve it, normally the launch time of the kernel and the data movement takes the most of the execution. Pre execution by MLIR JIT statically optimize out the launch but do GPU context coroutine.

[CSE211 Reading] UNIT framework

  • Motivation:
    • Problem Statement
      • With the advent and proliferation of Deep Neural Networks (DNNs), there's a significant increase in computational demands. Managing these demands efficiently is crucial for the performance and scalability of DNNs.
      • Traditional compilation processes for tensorized instructions, which are central to DNN computations, can be cumbersome and may not fully exploit hardware and software optimizations.
      • To address the computational and memory challenges by unifying the compilation process for tensorized instructions. A unified approach can lead to better utilization of available resources and ease the integration of new instructions.
    • Motivation
      • Reducing multiple low precision to high
        • Horizontal reduction
        • Mixed precision
      • Software abstraction of the kernel library for tensorized instructions that can be deployed on intel VNNI, ARM SVE, and NVIDIA Tensor Cores. Possibly for AMX now. Nowadays, they only have library compliance for tensorized instructions each time they upgrade the GPU/TPU/ASIC.
      • Alt text
  • Compiler Solution - The UNIT Framework - A virtual ISA abstraction: Alt text
    • Framework Overview
      • The authors propose a compiler framework called UNIT to unify the compilation for tensorized instructions.
      • At the heart of UNIT is a unified semantics abstraction which simplifies the integration of new instructions and allows for the reuse of analysis and transformations.
    • Key Features
      • Ease of Integration: The framework is designed to simplify the integration of new tensorized instructions.
      • Reuse of Analysis and Transformations: By adopting a unified semantics abstraction, the framework facilitates the reuse of analysis and transformations, promoting efficiency.
      • Translation of Memory Operations: The framework translates memory operations to tensorized instructions and do optimizations over this.
    • Mixed Precision Data Types
      • A notable approach to reducing computational and memory burden is the use of mixed precision data types. This approach is widely adopted and is integrated within the UNIT framework.
    • Mixed Sized Data Types
      • The authors also propose the use of mixed sized data types to reduce the computational and memory burden. This approach is also widely adopted and is integrated within the UNIT framework.
  • Illustrative Example: Alt text Alt text We first use the Arithmetic Isomorphism for a single thread using split/tile reorder and unroll. Alt text Then we have the Memory Isomorphism for intel VNNI, basically mark the loop invariant and lower to the other precision and other sized tensorized operations. Alt text Alt text Alt text Finally we do the transformation for Loop Recorganization for registers Alt text
  • Implementation and Key Results:
    • The paper likely provides an implementation of the UNIT framework and evaluates its performance against traditional compilation processes. (Note: The actual implementation and evaluation details would be in the paper which wasn't accessible during the research.) Alt text Alt text
  • Discussion and Future Directions:
    • Benefits
      • The UNIT framework, as per the authors, presents a viable solution to the challenges of compiling tensorized instructions efficiently, thereby addressing a critical aspect of DNN performance optimization.
    • Implications
      • The proposed framework could have broader implications for the field of DNNs, particularly in how tensor computations are handled in both hardware and software domains.
    • Future Work
      • The paper may suggest directions for future work to further enhance the UNIT framework or explore other optimizations in tensorized instruction compilation.
  • Conclusion and Comments:
    • The UNIT framework emerges as a significant contribution towards optimizing the compilation process for tensorized instructions, addressing the computational and memory challenges associated with DNNs.
    • We need a vISA for dynamically JIT the vISA will be great for migrating the same code to different hardware rather than library compliance.

ExaScale: Rethinking Von Neumann for modern GPU compared with DSA(TPU, CGRA, Like-brain, PIM)

The current trend of developing AI accelerators is not following Von Neumann's view. What was Von Neumann's outcome? Multi-tenancy, virtualization, fine-grained scheduling, mapping back to the compiler, and cross-platform live migration. Why is this property deprecated in a lot of so-called Von Neumann Architecture? It's because the current microarchitecture state is too complicated to fully manifest for the programmers to understand, which cancels out a lot of people's interests. I think Professor Jiang Yanyan's abstraction of the operating system as an automaton is incorrect because of the explosion of the transparent state to OS; the GPU is not fully debuggable, letting along other coarse-grained architecture in TPU. So if you couldn't fine-tune your scheduling, the outcome is if the workload is constantly changing, your chip and infrastructure will never beat Nvidia because they have better infrastructure, and TFLOPS is close to the extreme of what any chip can do. Tomorrow, if I want to deploy LLM+HPC, all the DSA will just die because of this. I think the abstraction of CUDA or the abstraction of C++ language level is good for programmers to program, but far more deviated from the Von Neumann property all done. If other academic proposals want to commercialize either one of the DSAs, like TPU, CGRA, Like-brain, or PIM, they might lose any of the above Von Neumann properties and won't be useful if those architectures don't have the 10x speedup and agility that CUDA and GPU provide.

In terms of virtualization, GPUs are never ready for virtualization because the current virtualization techniques on GPUs are still VFIO, which is CPU-dominated and slow. Ray, as I mentioned before, has a Von Neumann Memory Wall, and epoch-based is not fine-grain granularity. and we should never adapt to the front end like PyTorch or CUDA because it doesn't change anything in the meaningless abstraction or working for monopoly; we need a revolution from the architecture and back to the abstract to the language. We need to go back to normal; why did we lose this property? In the realm of modern GPU architectures, there's an emerging sentiment: as long as we utilize CUDA's Just-In-Time (JIT) compilation capabilities, we can achieve a faster Virtual Instruction Set Architecture (ISA)—for instance, something akin to WebGPU/Vulkan/ptx. This could lead to virtualization speeds surpassing traditional methods like VFIO with no semantic or performance sacrifice.

Am I saying DSA is no longer useful? No, if everything in a space is very mature, I guess the DSA will eventually win, but things change every day. Speculative decoding is mathematically the same as whole decoding, saving your training set 10 times, so your TPU is not agile enough to tailor to this change, but your GPU can quickly adopt the new math advancement. TPU has an inference market. If Google Gemini is going to take everybody in the next month, the TPU behind it will be very money-saving in terms of the electricity cost, which only Google can do in the entire universe. Other technologies, like CGRA or like-brain technology, are unsolvable in the near future.

ExaScale aims to beat Nvidia, not by breaking the monopoly that Nvidia has, but first by making transparent migration over different GPUs and connecting them within a memory pool that is not Nvidia's alone. This will facilitate price competition because Nvidia will no longer have competitive edges. The second is to hack the interconnect through CXL or another faster fabric that beats NVLink with software hardware codesign like CXLMemUring. I guess this movement will be the future of how we integrate everything!

My view of Mojo's success and Computer Architecture: A quantitative approach's fallacy

David Patterson's Computer Architecture

I think TPU is wrong, RVV is wrong, Google WSC price calculation is deprecated, and X86 is not as bad as RISCV, so I guess we need to revisit Computer Architecture: A Quantitative Approach. The main fallacy is that most of the work added is David's own work, neither guiding anything in the arch space nor having a profound impact that endures the testimony of time. I think Arch should have codesign, but not VLIW, and should not redo the things that have been discussed a long time ago. The ideology of the whole book misled the architect into having new ideas to thrive in this golden age. I'm saying this because I found Thead's fallacy in RVV and many other fallacies, and programmers' view of this is based on those misleading books.


Implemented in GO and codegen to MLIR with a standard library implemented in C++. I would say it is currently just a frontend of Python that codegen to MLIR with cohesion to Python FFI and CFFI, like what I did for ChocoPy-LLVM [6]. I think Chris' idea is to map the Python semantics, especially the memory model, to Rust or C++ so that every memory can be managed as RAII with shared ptr plus some workarounds without GC. Suddenly, I feel that the transition from LLVM to MLIR is a very natural thing. Instead of defining a set for AMX, AVX512, and NVVM separately, it's better to integrate them.

Static Analysis

  • Class is not implemented yet; no multi-inheritance
  • "Try" is needed for mapping to the C++ exception model.
  • To increase speed, use the grammar sugar for calling the vector MLIR in [4], and parallel call primitives. It's seamlessly connected and has easily been called to WASM/WebGPU.

Implementation of LLDB and MLIR

Debug with Location info

  • Basically, C++ with MLIR and mapping back DWARF to mojo.
  • C++ ABI
  • The current mapping to the debugger of LLDB is not ready.

MLIR lowering to GPU/CPU heterogeneous code

var y : __mlir_type.i1
if x:
    y = __mlir_op.`index.bool.constant`[value : __mlir_attr.`true`]()
    y = __mlir_op.`index.bool.constant`[value : __mlir_attr.`false`]()
  • -mcpu=sapphirerapids with avx512
This image has an empty alt attribute; its file name is image-7-1024x739.png
  • -mcpu=amdgpu call from cpu to gpu

Currently, there's no MLIR code generated, and I don't want to do RE to dump that. You can write some MLIR implementation in amdgpu to force heterogeneous code.


  • https://github.com/victoryang00/CS131-discussion/blob/main/11-discussion.tex
  • https://github.com/modularml/mojo/issues/3
  • https://mlir.llvm.org/docs/Dialects/AMDGPU/
  • https://mlir.llvm.org/docs/Dialects/MathOps/
  • https://mlir.llvm.org/docs/Dialects/IndexOps/
  • https://github.com/Chocopy-LLVM/chocopy-llvm