ExaScale: Rethinking Von Neumann for modern GPU compared with DSA(TPU, CGRA, Like-brain, PIM)

The current trend of developing AI accelerators is not following Von Neumann's view. What was Von Neumann's outcome? Multi-tenancy, virtualization, fine-grained scheduling, mapping back to the compiler, and cross-platform live migration. Why is this property deprecated in a lot of so-called Von Neumann Architecture? It's because the current microarchitecture state is too complicated to fully manifest for the programmers to understand, which cancels out a lot of people's interests. I think Professor Jiang Yanyan's abstraction of the operating system as an automaton is incorrect because of the explosion of the transparent state to OS; the GPU is not fully debuggable, letting along other coarse-grained architecture in TPU. So if you couldn't fine-tune your scheduling, the outcome is if the workload is constantly changing, your chip and infrastructure will never beat Nvidia because they have better infrastructure, and TFLOPS is close to the extreme of what any chip can do. Tomorrow, if I want to deploy LLM+HPC, all the DSA will just die because of this. I think the abstraction of CUDA or the abstraction of C++ language level is good for programmers to program, but far more deviated from the Von Neumann property all done. If other academic proposals want to commercialize either one of the DSAs, like TPU, CGRA, Like-brain, or PIM, they might lose any of the above Von Neumann properties and won't be useful if those architectures don't have the 10x speedup and agility that CUDA and GPU provide.

In terms of virtualization, GPUs are never ready for virtualization because the current virtualization techniques on GPUs are still VFIO, which is CPU-dominated and slow. Ray, as I mentioned before, has a Von Neumann Memory Wall, and epoch-based is not fine-grain granularity. and we should never adapt to the front end like PyTorch or CUDA because it doesn't change anything in the meaningless abstraction or working for monopoly; we need a revolution from the architecture and back to the abstract to the language. We need to go back to normal; why did we lose this property? In the realm of modern GPU architectures, there's an emerging sentiment: as long as we utilize CUDA's Just-In-Time (JIT) compilation capabilities, we can achieve a faster Virtual Instruction Set Architecture (ISA)—for instance, something akin to WebGPU/Vulkan/ptx. This could lead to virtualization speeds surpassing traditional methods like VFIO with no semantic or performance sacrifice.

Am I saying DSA is no longer useful? No, if everything in a space is very mature, I guess the DSA will eventually win, but things change every day. Speculative decoding is mathematically the same as whole decoding, saving your training set 10 times, so your TPU is not agile enough to tailor to this change, but your GPU can quickly adopt the new math advancement. TPU has an inference market. If Google Gemini is going to take everybody in the next month, the TPU behind it will be very money-saving in terms of the electricity cost, which only Google can do in the entire universe. Other technologies, like CGRA or like-brain technology, are unsolvable in the near future.

ExaScale aims to beat Nvidia, not by breaking the monopoly that Nvidia has, but first by making transparent migration over different GPUs and connecting them within a memory pool that is not Nvidia's alone. This will facilitate price competition because Nvidia will no longer have competitive edges. The second is to hack the interconnect through CXL or another faster fabric that beats NVLink with software hardware codesign like CXLMemUring. I guess this movement will be the future of how we integrate everything!

My view of Mojo's success and Computer Architecture: A quantitative approach's fallacy

David Patterson's Computer Architecture

I think TPU is wrong, RVV is wrong, Google WSC price calculation is deprecated, and X86 is not as bad as RISCV, so I guess we need to revisit Computer Architecture: A Quantitative Approach. The main fallacy is that most of the work added is David's own work, neither guiding anything in the arch space nor having a profound impact that endures the testimony of time. I think Arch should have codesign, but not VLIW, and should not redo the things that have been discussed a long time ago. The ideology of the whole book misled the architect into having new ideas to thrive in this golden age. I'm saying this because I found Thead's fallacy in RVV and many other fallacies, and programmers' view of this is based on those misleading books.

MOJO

Implemented in GO and codegen to MLIR with a standard library implemented in C++. I would say it is currently just a frontend of Python that codegen to MLIR with cohesion to Python FFI and CFFI, like what I did for ChocoPy-LLVM [6]. I think Chris' idea is to map the Python semantics, especially the memory model, to Rust or C++ so that every memory can be managed as RAII with shared ptr plus some workarounds without GC. Suddenly, I feel that the transition from LLVM to MLIR is a very natural thing. Instead of defining a set for AMX, AVX512, and NVVM separately, it's better to integrate them.

Static Analysis

  • Class is not implemented yet; no multi-inheritance
  • "Try" is needed for mapping to the C++ exception model.
  • To increase speed, use the grammar sugar for calling the vector MLIR in [4], and parallel call primitives. It's seamlessly connected and has easily been called to WASM/WebGPU.

Implementation of LLDB and MLIR

Debug with Location info

  • Basically, C++ with MLIR and mapping back DWARF to mojo.
  • C++ ABI
  • The current mapping to the debugger of LLDB is not ready.

MLIR lowering to GPU/CPU heterogeneous code

var y : __mlir_type.i1
if x:
    y = __mlir_op.`index.bool.constant`[value : __mlir_attr.`true`]()
else:
    y = __mlir_op.`index.bool.constant`[value : __mlir_attr.`false`]()
  • -mcpu=sapphirerapids with avx512
This image has an empty alt attribute; its file name is image-7-1024x739.png
  • -mcpu=amdgpu call from cpu to gpu

Currently, there's no MLIR code generated, and I don't want to do RE to dump that. You can write some MLIR implementation in amdgpu to force heterogeneous code.

Reference

  • https://github.com/victoryang00/CS131-discussion/blob/main/11-discussion.tex
  • https://github.com/modularml/mojo/issues/3
  • https://mlir.llvm.org/docs/Dialects/AMDGPU/
  • https://mlir.llvm.org/docs/Dialects/MathOps/
  • https://mlir.llvm.org/docs/Dialects/IndexOps/
  • https://github.com/Chocopy-LLVM/chocopy-llvm

[A Turing Award level idea] Slug Architecture:  Break the Von Neumann Great Memory Wall in performance, debuggability, and security

I'm exposing this because I'm as weak as only one Ph.D. student in terms of making connections to people with resources for getting CXL machines or from any big company. So, I open-sourced all my ideas, waiting for everybody to contribute despite the NDA. I'm not making this prediction for today's machine because I think the room-temperature superconductor may come true someday. The core speed can be 300 GHz, and possibly the memory infrastructure for that vision is wrong. I think CXL.mem is a little backward, but CXL.cache plus CXL.mem are guiding future computation. I want to formalize the definition of slug architecture, which could possibly break the Von Neumann Architecture Wall.

Von Neumann is the god of computer systems. That CPU gets an arbitrary input; it will go into an arbitrary output. The abstraction of Von Neumann is that it gets all the control flow, and data flow happens within the CPU, which uses memory itself for load and storage. So, if we snapshot all the states within the CPU, we can replay them elsewhere.

Now, we come to the scenario of heterogeneous systems. The endpoint could happen in the PCIe attachment or within the SoC that adds the ISA extension to a certain CPU, like Intel DSA, IAA, AVX, or AMX. The former is a standalone Von Neumann Architecture that does the same as above; the latter is just integrated into the CPU, which adds the register state for those extensions. If the GPU wants to access the memory inside the CPU, the CPU needs to offload the control flow and synchronize all the data flow if you want to record and replay things inside the GPU. The control flow is what we are familiar with, which is CUDA. It will rely on the UVM driver in the CPU to get the offloading control flow done and transmit the memory. When everything is done, UVM will put the data the right way inside the CPU, like by leveraging DMA or DSA in a recent CPU. Then we need to ask a question: Is that enough? We see solutions like Ray that use the above method of data movement to virtualize certain GPU operations, like epoch-wise snapshots of AI workloads, but it's way too much overhead.

That's where Slug Architecture takes place. Every endpoint that has a cache agent (CHA), which in the above graph is the CPU and GPU, is Von Neumann. The difference is we add green stuff inside the CPU; we already have implementations like Intel PT or Arm Core Sight to record the CPU Von Neumann operations, and the GPU has nsys with private protocols inside their profiler to do the hack to record the GPU Von Neumann operations, which is just fine in side Slug Architecture. The difference is that the Slug Architecture requires every endpoint to have an External Memory Controller that does more than memory load and store instructions; it does memory offload (data flow and control flow that is not only ld/st) requests and can monitor every request to or from this Von Neumann Architecture's memory requests just like pebs do. It could be software manageable for switching on or off. Also, inside every EMC of traditional memory components, like CXL 3.0 switches, DRAM, and NAND, we have the same thing for recording those. Then the problem is, if we decouple all the components that have their own state, can we only add EMC's CXL fabric state to record and replay? I think it's yes. The current offloading of the code and code monitoring for getting which cycle to do what is event-driven is doable by leveraging the J Extension that has memory operations bubbles for compiling; you can stall the world of the CPU and let it wait until the next event!

It should also be without the Memory to share the state; the CPU is not necessarily embracing all the technology that it requires, like it can decouple DSA to another UCIe packaged RiscV core for better fetching the data, or a UCIe packaged AMX vector machine, they don't necessarily go through the memory request, but they can be decoupled for record and replay leveraging the internal Von Neumann and EMC monitoring the link state.

In a nutshell, Slug Architecture is defined as targeting less residual data flow and control flow offloading like CUDA or Ray. It has first-priority support for virtualization and record & replay. It's super lightweight, without the need for big changes to the kernel.

Compare with the network view? There must be similar SDN solutions to the same vision, but they are not well-scaled in terms of metadata saving and Switch fabric limitation. CXL will resolve this problem across commercial electronics, data centers, and HPC. Our metadata can be serialized to distributed storage or CXL memory pools for persistence and recorded and replayed on another new GPU, for instance, in an LLM workflow with only Intel PT, or component of GPU, overhead, which is 10% at most.

Reference

  • https://www.servethehome.com/sk-hynix-ai-memory-at-hot-chips-2023/