Entertainment 🤮 - vickieGPT’s blog

May 31, 2025May 31, 2025

How to map SIMT model onto tenstorrent device?

1. Overview of the SIMT Model

In SIMT, a group of “threads” (a warp) executes the same instruction in lock-step, with divergence handled via masking and predication. Each warp has a single program counter, and individual lanes (threads) may be inactive on divergent branches by using a per-lane mask. 维基百科
On GPUs (e.g., NVIDIA), programmers launch a kernel as a grid of thread blocks; each block contains multiple warps. The runtime hardware scheduler assigns warps to Streaming Multiprocessors (SMs), and within each SM all lanes advance in lock-step until divergence occurs. CliffsNotes

2. Tenstorrent’s Many-Core Architecture

Tensix Cores and Vector Units
- Each Tensix core is a full RISC-V CPU with its own register file, scalar ALU, and a vector/matrix unit (called the Vector Processing Unit, or VPU). There is no built-in concept of “warps” across multiple cores—each core fetches its own instruction stream. EE Times Martin's website/blog thingy
- Grayskull (the first generation) featured a 64-lane VPU with 19-bit FP support; Wormhole/Blackhole generational upgrades reduced this to a 32-lane VPU with 32-bit FP support. Because the RISC-V core is not inherently “SIMT-aware,” programmers rely on explicit masking within the VPU to emulate lock-step execution at the lane level. Martin's website/blog thingy
Spatial Tiling and Core Grid
- Tenstorrent devices are built as a 2D grid of Tensix cores connected via a high-bandwidth, low-latency mesh network. Each core has a slice of local (L1) SRAM and access to shared L2 memory across the chip. EE Times docs.tenstorrent.com
- Unlike CUDA’s implicit warp scheduling, Tenstorrent programs must explicitly partition work across cores (often via Metalium or MLIR). The compiler then places compute operators (e.g., matrix multiplies) onto specific cores according to a logical “core grid” (e.g., 7×1 grid of cores for one operator). docs.tenstorrent.com docs.tenstorrent.com

3. Why SIMT → Tenstorrent Requires Rethinking

No Hardware Warp Scheduler
There is no hardware unit that automatically groups threads into warps or handles divergence scheduling across multiple Tensix cores. Each Tensix core operates independently unless the program explicitly coordinates them. EE Times Hacker News
Divergence Is Explicit
In SIMT, divergence is hidden behind hardware masks; on Tenstorrent, one must set mask registers manually to disable lanes within the VPU when conditional branches occur. There is no implicit warp⁠-level stack. Martin's website/blog thingy EE Times
Memory Hierarchy Differences
GPUs expose shared and global memory with well-understood semantics (e.g., shared in CUDA). Tenstorrent’s cores each have local scratch (L1), with explicit direct memory access (DMA) units to fetch data from DRAM into local SRAM. One must orchestrate when and how each core reads/writes memory. docs.tenstorrent.com docs.tenstorrent.com

4. Strategy 1: Emulate Warp-Level Execution Inside a Single Tensix Core

Lane-Level Parallelism via Vector Instructions
- Map one GPU warp (e.g., 32 or 64 threads) directly onto the VPU lanes within a single Tensix core. Use vector registers (e.g., v0–v31) and mask registers (e.g., m0) to broadcast a single instruction to all lanes. Martin's website/blog thingy GitHub
- Whenever a SIMT thread block would have launched a 1D warp of N threads, rewrite that kernel loop so that each loop iteration populates all N lanes in a vector register (plus any required masking bits for divergence).
Handling Divergence with Predication
- Use Metalium’s mask registers (vmask) to enable/disable specific lanes when encountering an if condition. In CUDA, this happens implicitly; in Metalium you must write, for example:
```
vfcmpeq vmask, vcond, 0          // set mask where condition is false
vmov vout[vmask] = vfalse_value   // masked move: only lanes where mask=1 are updated
```
- On divergent branches, split execution paths into lane-active and lane-inactive sets, updating masks accordingly. This is manual predication rather than hardware-driven. Martin's website/blog thingy EE Times

Example: Vector Add with Divergence

// Suppose GPU kernel: if (threadIdx.x < N/2) out[i] = A[i]+B[i];
// else out[i] = A[i]-B[i];
// Vector lanes = warp size (e.g., 32).
load_vector vA, [baseA]            // loads 32 elements of A
load_vector vB, [baseB]            // loads 32 elements of B
set_vector_lane_idx vidx            // each lane holds its index 0..31
vcmpoge vmask, vidx, (N/2)          // mask lanes where idx >= N/2
vadd vtmp1, vA, vB [~vmask]         // add where mask=0 (idx < N/2)
vsub vtmp2, vA, vB [vmask]          // sub where mask=1 (idx >= N/2)
vor vout, vtmp1, vtmp2              // merge results
store_vector [baseOut], vout

In the above:

vcmpoge creates a mask with 1 for lanes where the condition is true.
vadd … [~vmask] executes add only on lanes where vmask=0.
vsub … [vmask] executes subtract only on lanes where vmask=1.
Finally, vor (bitwise OR) or an unconditional move merges the two vectors.
Martin's website/blog thingy GitHub

Pros & Cons of Single-Core SIMT Emulation
- Pros:
  - Reuses the VPU’s 32/64 lanes to mimic warp parallelism.
  - No need to coordinate across multiple cores for simple kernels.
- Cons:
  - Divergence overhead is higher than real GPU SIMT because predication and merging must be explicit.
  - Performance varies as the VPU width changes across chip generations (Grayskull vs. Wormhole), so kernel must detect or be recompiled. Martin's website/blog thingy

5. Strategy 2: Map “Warps” Across Multiple Tensix Cores

Partition the Thread-Block Over Multiple Cores
- Instead of mapping an entire warp into one core’s VPU, split the warp into sub-groups across adjacent cores in the 2D grid. For instance, treat a 32-thread warp as 4 lanes on each of 8 cores (each core’s VPU width = 8 lanes or 4 lanes, depending on generation). docs.tenstorrent.com docs.tenstorrent.com
- Each core executes the same instruction stream (SPMD style) but uses only its local lanes. To keep them in lock-step, broadcast the control flow decisions (the “program counter”) via explicit barriers or very fast mesh synchronization.
Explicit Synchronization & Broadcast
- In CUDA, warp divergence is handled by an internal stack and warp-level barrier. On Tenstorrent, you must insert an explicit core-level barrier after the divergence test. For example, each core computes its local boolean predicate; then all cores exchange one bit of information (e.g., via a reduction or broadcast instruction) to decide if any lane took the “true” path. Only after sharing this single bit can every core agree on which sub-vector to execute next. EE Times Hacker News
- Tenstorrent exposes a low-latency “rendezvous” or “barrier” primitive in Metalium that blocks until all designated cores reach the same barrier ID. Use this to synchronize control flow across cores. EE Times
Memory Layout Across Cores
- Distribute input arrays (A, B, …) such that each core reads a contiguous chunk from DRAM into its L1 scratch via DMA. This is analogous to CUDA’s coalesced loads, but done explicitly by computing each core’s tile coordinates in the global index space. docs.tenstorrent.com docs.tenstorrent.com
- After computing, store each core’s partial outputs back to DRAM using DMA, again using explicit address calculations to avoid bank conflicts.

Example: Vector Dot-Product Over Four Cores
Imagine you want to compute a dot product of two length-1024 vectors using a warp of 32 “threads.” On Tenstorrent, choose 4 cores (each VPU width = 8 lanes on this generation).

Tile Assignment:
- Core (0,0) processes lanes 0–7 (indices 0..7).
- Core (0,1) processes lanes 8–15.
- Core (0,2) processes lanes 16–23.
- Core (0,3) processes lanes 24–31.

Kernel Sketch on Each Core:

// Each core: baseIdx = coreId * (warpSize/coreRowCount)
baseIdx = core_row * 8
// Loop over 1024/32 = 32 iterations to cover full vector:
for (i = 0; i < 32; ++i) {
  globalIdx = baseIdx + i * 32   // Each iteration jumps by warp size
  dma_load vA, [A + globalIdx]   // Load 8 elements into vector register
  dma_load vB, [B + globalIdx]   // Load 8 elements 
  vfmadd vAcc, vA, vB, vAcc      // Accumulate partial dot product
}
// Perform tree-reduce within VPU lanes to get a single scalar in lane 0:
vreduce_sum vSum, vAcc
// Now do an asynchronous global barrier to share partial results:
barrier_and_reduce_sum globalAcc, vSum
// After barrier, core(0,0) holds final result; others can be masked off.
if (core_row == 0 && core_col == 0) {
  store [DotResult], globalAcc
}

vmfmadd: fused multiply-add across vector lanes.
vreduce_sum: tree-reduce across the 8 lanes within the VPU.
barrier_and_reduce_sum: a hypothetical Metalium primitive that sums vSum from all participating cores (0,0–0,3) and broadcasts the final 32-bit scalar to each core’s local register. EE Times docs.tenstorrent.com

Pros & Cons of Multi-Core SIMT Mapping
- Pros:
  - Spreads the warp across multiple VPUs to use all cores, not just one.
  - When divergence is rare, cores can keep running largely in lock-step with minimal barrier overhead.
- Cons:
  - Each divergence requires a full inter-core barrier and mask recalculation—costlier than intra-warp predication on a real GPU.
  - Tightly coupling control flow across cores may reduce spatial reuse if one lane group diverges heavily.
  - Code complexity increases: one must explicitly orchestrate tile ownership, synchronization, and mask calculation for each branching point.

6. Strategy 3: Adopt Native Spatial Programming (Avoid SIMT Semantics Altogether)

Reframe Problem as Dataflow Over Core Grid
- Instead of forcing a SIMT mindset, decompose the algorithm into independent “operators” and pipeline them across cores. For example, for a convolution kernel, assign each 2D tile of the output to a small sub-grid of cores (e.g., a 2×2 block). Each core handles only its tile’s multiplications and reductions, communicating partial sums to neighbors via the mesh network. GitHub docs.tenstorrent.com
- Tenstorrent’s TT-MLIR dialect natively understands meshShape and tensor layouts—so rather than thinking “threads in a warp,” think “operators on a 2D grid.”
Use Metalium for Custom Kernels
- Write kernels in Metalium that directly issue vector and scalar instructions to each core. There is no warp abstraction; each core runs its own code path. If occasional synchronization is needed, insert explicit barrier() calls. EE Times GitHub
- Use BUDa’s performance analyzer to visualize how different operators tile onto the core grid, then apply compiler “placement overrides” to fix any hot spots or data reblocking inefficiencies. docs.tenstorrent.com YouTube
Advantages of Native Spatial Mapping
- No need to emulate warp-level divergence; each core can naturally follow its own control flow and only coordinate when data dependencies demand it.
- Memory and data movement are explicit, so there is greater predictability in latency and bandwidth usage.
- The architecture shines for workloads that are naturally expressible as 2D/3D tensor partitions (CNNs, transformers).
When to Avoid SIMT-Style Emulation
- If your kernel has highly irregular control flow (e.g., graph algorithms, tree traversals), forcing warp semantics will introduce excessive barriers and mask management. In such cases, write each core’s code to handle a subset of elements independently and only synchronize at the end of large phases.

7. Putting It All Together: A Practical Roadmap

Choose Your Level of Abstraction
- High-Level (TT-MLIR/TT-NN): Let the Tenstorrent compiler decide how to map tensor operations to the core grid. Simply describe your computation in MLIR or use TT-NN for standard layers (convs, matmuls). docs.tenstorrent.com
- Metalium (Bare-Metal): Write custom kernels when you need fine-grained control (e.g., specialized GEMM microkernels). Decide if you really need SIMT semantics or if a dataflow approach is better. EE Times
If You Must Emulate SIMT
- Within One Core (Strategy 1): Keep “warps” within a VPU. Use vector masking for divergence. Recompile kernels for each VPU width (32-lane vs. 64-lane).
- Across Multiple Cores (Strategy 2): Partition warps across a row or column of cores. Add explicit barriers to synchronize control flow. Distribute memory tiles carefully to ensure coalesced DMA.
Testing & Performance Tuning
- Use BUDA’s “placement report” to verify that your operators are balanced (i.e., each core does similar work) and that reblocking between producers and consumers is minimized. docs.tenstorrent.com
- Profile to check for stalls caused by excessive inter-core barriers. If barriers dominate, consider switching to a fully dataflow pattern where each core is more autonomous.
- Tune vector unrolling, tile sizes, and the shape of core grids to maximize local reuse and minimize mesh traffic.

8. Summary

Tenstorrent does not provide a native SIMT warp scheduler; instead, you must either:
1. Emulate warp-level parallelism inside a single VPU (using vector mask registers), or
2. Partition a warp across multiple cores and manage synchronization explicitly, or
3. Re-architect your algorithm in spatial/dataflow terms so that each core is given a distinct tile of work with minimal branching.
Which approach to pick depends on your application’s control-flow characteristics and performance priorities. For highly regular kernels (dense linear algebra, convolution), vector-lane emulation (Strategy 1) or even better, pure spatial tiling (Strategy 3) typically yields the best performance. For irregular kernels where warp divergence would be severe, avoid SIMT emulation and instead write independent code per core.

By following this roadmap—carefully selecting between single-core vector masking or multi-core synchronization, or better yet, native spatial tiling—you can successfully map a SIMT-style kernel onto Tenstorrent hardware and exploit its massive 2D core mesh for high throughput.

April 19, 2025April 19, 2025

CXL 3.0 环境下的操作系统设计：挑战与机遇

1. 引言

数据中心架构正在经历一场深刻的变革，其驱动力源自人工智能 (AI)、机器学习 (ML) 以及大规模数据分析等新兴工作负载的爆炸式增长 1。这些工作负载对计算能力、内存容量和带宽提出了前所未有的要求，推动数据中心向异构计算和分解式基础架构 (Disaggregated Infrastructure) 演进。然而，传统的服务器架构和互连技术，如 PCI Express (PCIe)，在满足这些需求方面日益捉襟见肘。CPU 核心数量的增长速度远超每核心内存带宽和容量的增长速度，导致了所谓的“内存墙”问题，即系统性能受到内存访问速度和容量的严重制约 5。此外，PCIe 主要作为一种 I/O 互连，缺乏对缓存一致性的原生支持，限制了 CPU 与加速器、扩展内存之间高效、低延迟的数据共享能力 18。

在此背景下，Compute Express Link (CXL) 应运而生。CXL 是一种基于 PCIe 物理层构建的开放、缓存一致性互连标准，旨在打破传统架构的瓶颈 1。它的核心目标是提供低延迟、高带宽的连接，并在 CPU 和连接的设备（如加速器、内存缓冲器、智能 I/O 设备）之间维护内存一致性，从而实现高效的资源共享、内存扩展、内存池化和内存共享 1。

CXL 3.0 规范的发布标志着 CXL 技术的一个重要里程碑 1。它在前几代 CXL 的基础上，显著增强了 Fabric（结构）能力、交换功能、内存共享机制和点对点通信能力，为构建更大规模、更灵活、更高效的分解式和可组合式系统奠定了基础 1。然而，这些强大的新功能也给操作系统的设计带来了全新的挑战和机遇。操作系统作为硬件资源和应用程序之间的桥梁，必须进行相应的调整和创新，才能充分发挥 CXL 3.0 的潜力。

CXL 的出现，特别是 CXL 3.0 引入的 Fabric、内存共享和 P2P 等特性，不仅仅是对现有 PCIe 总线的简单扩展或性能提升。它预示着计算架构从传统的以处理器为中心向以内存为中心、从节点内资源管理向跨 Fabric 资源管理的根本性转变 3。这种转变要求操作系统设计者重新思考内存管理、资源调度、I/O 处理和安全模型等核心机制，仅仅在现有操作系统上进行修补可能无法充分利用 CXL 带来的优势，甚至可能导致性能瓶颈。因此，操作系统需要进行范式转换，以适应这种新的硬件架构。

本报告旨在深入探讨 CXL 3.0 技术对操作系统设计的具体影响，全面分析操作系统在内存管理、资源调度、I/O 子系统、设备管理和安全机制等方面需要进行的适配和重构。报告将结合 CXL 3.0 的关键特性，分析其带来的性能优势与挑战，梳理当前学术界和工业界在 CXL OS 方面的研究进展和实现状况（特别是在 Linux 内核中的支持），并展望 CXL 及类似 Fabric 技术对未来操作系统架构的长期影响。本报告的结构将围绕用户提出的八个关键问题展开，力求为理解和设计面向 CXL 3.0 的下一代操作系统提供全面而深入的技术洞见。

2. CXL 3.0 技术深度解析

为了理解 CXL 3.0 对操作系统设计的深远影响，首先需要深入了解其关键技术特性及其相较于早期版本的演进。

2.1 从 CXL 1.x/2.0 演进

CXL 标准自 2019 年发布以来经历了快速迭代。

CXL 1.x (1.0/1.1): 最初版本主要关注处理器与加速器、内存扩展模块之间的点对点连接 25。它定义了 CXL.io、CXL.cache 和 CXL.mem 三种协议，支持设备缓存主机内存 (Type 1 设备) 或主机访问设备内存 (Type 3 设备)，以及两者兼具 (Type 2 设备) 25。CXL 1.1 主要用于内存扩展，允许 CPU 访问连接在 PCIe 插槽上的 CXL 内存设备，缓解服务器内存容量瓶颈 9。此阶段的连接是直接的，不支持交换或池化。
CXL 2.0: 于 2020 年发布，引入了关键的单级交换 (Single-Level Switching) 功能 5。这使得单个 CXL 2.0 主机可以连接到交换机下的多个 CXL 1.x/2.0 设备，更重要的是，它实现了内存池化 (Memory Pooling) 11。通过 CXL 交换机和多逻辑设备 (Multi-Logical Devices, MLDs) 功能（一个物理设备可划分为多达 16 个逻辑设备），内存资源可以被多个主机共享（但任一时刻一个逻辑设备只能分配给一个主机）5。CXL 2.0 还引入了全局持久化刷新 (Global Persistent Flush) 和链路级完整性与数据加密 (Integrity and Data Encryption, IDE) 5。但 CXL 2.0 的带宽仍基于 PCIe 5.0 (32 GT/s)，且交换仅限于树状拓扑内的单层交换 5。
CXL 3.0: 2022 年发布的 CXL 3.0 是一次重大升级，旨在进一步提升可扩展性、灵活性和资源利用率 1。其关键进步包括：
带宽翻倍: 基于 PCIe 6.0 物理层和 PAM-4 信号，数据速率提升至 64 GT/s，理论带宽翻倍（例如 x16 链路双向原始带宽可达 256 GB/s）1。
零附加延迟: 尽管速率翻倍，但通过优化（如 LOpt Flit 模式）2，其链路层附加延迟相较于 CXL 2.0 保持不变 1。
Fabric 能力: 引入了 Fabric 概念，支持多级交换 (Multi-Level Switching) 和非树形拓扑（如 Mesh, Ring, Spine/Leaf），极大地扩展了系统连接的可能性 1。
增强的内存池化与共享: 在 CXL 2.0 池化基础上，增加了真正的内存共享 (Memory Sharing) 功能，允许多个主机通过硬件一致性机制同时、相干地访问同一内存区域 1。
增强的一致性: 引入了新的对称/增强一致性模型，特别是反向失效 (Back-Invalidation, BI) 机制，取代了 CXL 2.0 的 Bias-Based Coherency，提高了设备管理主机内存 (HDM) 的效率和可扩展性 2。
点对点 (Peer-to-Peer, P2P) 通信: 允许 CXL 设备在 Fabric 内直接通信，无需主机 CPU 中转 1。
向后兼容性: CXL 3.0 完全向后兼容 CXL 2.0, 1.1 和 1.0 1。
CXL 3.1/3.2 续进: CXL 3.1 (2023年11月) 和 CXL 3.2 (2024年12月) 在 3.0 基础上继续演进。CXL 3.1 重点增强了 Fabric 的可扩展性（如 PBR 扩展）和安全性（引入可信安全协议 Trusted Security Protocol, TSP 用于机密计算）以及内存扩展器的功能（如元数据支持、RAS 增强）22。CXL 3.2 则进一步优化了内存设备的监控和管理（如CXL 热页监控单元 CXL Hot-Page Monitoring Unit, CHMU 用于内存分层）、增强了 OS 和应用的功能性、并扩展了 TSP 安全性 23。这些后续版本虽然超出了本次报告的核心范围（CXL 3.0），但它们指明了 CXL 技术持续发展的方向，对理解 CXL 生态的未来至关重要。

2.2 关键架构特性详解

以下将深入探讨 CXL 3.0 引入的核心架构特性及其对系统设计的影响。

Fabric 能力与多级交换:
CXL 3.0 最具革命性的变化之一是引入了 Fabric 能力 1。这打破了传统 PCIe 基于树状结构的限制，允许构建更灵活、更具扩展性的网络拓扑，如网格 (Mesh)、环形 (Ring)、胖树 (Fat Tree) 或 Spine/Leaf 架构 4。这种灵活性通过多级交换 (Multi-Level Switching) 实现，即 CXL 交换机可以级联，一个交换机可以连接到另一个交换机，而不仅仅是连接到主机和终端设备 1。这与 CXL 2.0 仅支持单层交换形成鲜明对比 5。
为了管理如此庞大和复杂的 Fabric，CXL 3.0 引入了基于端口的路由 (Port Based Routing, PBR) 机制，这是一种可扩展的寻址方案，理论上最多可支持 4096 个节点 2。这些节点可以是主机 CPU、CXL 加速器（带或不带内存，即 Type 1/2 设备）、CXL 内存设备（Type 3 设备）、全局 Fabric 附加内存 (GFAM) 设备，甚至可以是传统的 PCIe 设备 2。此外，CXL 3.0 允许每个主机根端口连接多个不同类型的设备（Type 1/2/3），进一步增强了拓扑的灵活性 5。多头设备 (Multi-headed Devices) 也是 CXL 3.0 Fabric 的一个特性，允许单个设备（尤其是内存设备）直接连接到多个主机或交换机端口 1。
内存池化与共享:
CXL 2.0 引入了内存池化的概念，允许将 CXL 连接的内存视为可替代资源，根据需求灵活地分配给不同的主机 2。这主要通过 MLD 实现，一个物理设备可以划分为多个逻辑设备 (LDs)，每个 LD 在某一时刻分配给一个主机 5。
CXL 3.0 在此基础上引入了内存共享 (Memory Sharing) 1。与池化不同，共享允许多个主机同时、相干地访问 CXL 内存的同一区域 2。这是通过 CXL 3.0 的硬件一致性机制（详见下文）来实现的，确保所有主机都能看到最新的数据，无需软件协调 2。
全局 Fabric 附加内存 (Global Fabric Attached Memory, GFAM) 是 CXL 3.0 实现大规模内存共享和池化的关键设备类型 2。GFAM 设备类似于 Type 3 设备，但它可以被 Fabric 中的多个节点（最多 4095 个）通过 PBR 灵活访问，构成一个大型共享内存池，将内存资源从处理单元中解耦出来 2。
一致性:
CXL 的核心优势之一是其维护内存一致性的能力 1。这是通过 CXL.cache 和 CXL.mem 协议实现的 4。CXL.cache 允许设备（如 Type 1/2 加速器）一致地缓存主机内存，而 CXL.mem 允许主机一致地访问设备内存（如 Type 2/3 设备的内存）。
CXL 3.0 引入了增强的/对称的一致性 (Enhanced/Symmetric Coherency) 机制，取代了 CXL 2.0 中效率较低的 Bias-Based Coherency 2。关键在于反向失效 (Back-Invalidation, BI) 协议 2。在 CXL 2.0 中，如果设备修改了其主机管理的内存 (HDM)，它无法直接使主机 CPU 缓存中的副本失效，需要复杂的 Bias Flipping 机制。而 CXL 3.0 的 BI 允许 Type 2/3 设备在修改其内存（HDM-D 或 HDM-DB）后，主动通过主机向其他缓存了该数据的设备或主机本身发送失效请求，从而维护一致性 2。这使得设备端可以实现 Snoop Filter，更有效地管理和映射更大容量的 HDM 2。这种对称性也为硬件管理的内存共享奠定了基础 2。
点对点 (P2P) 通信:
CXL 3.0 实现了设备之间的直接 P2P 通信，数据传输无需经过主机 CPU 中转，从而降低延迟和 CPU 开销 1。这种 P2P 通信发生在 CXL 定义的虚拟层级 (Virtual Hierarchy, VH) 内，VH 是维护一致性域的设备关联集合 5。
CXL 3.0 利用 CXL.io 协议中的无序 I/O (Unordered I/O, UIO) 流来实现 P2P 访问设备内存 (HDM-DB) 5。UIO 借鉴了 PCIe 的概念，允许在某些情况下放松严格的 PCIe 事务排序规则，以提高性能和实现 P2P 30。当 P2P 访问的目标内存 (HDM-DB) 可能被主机或其他设备缓存时，为了保证 I/O 一致性，目标设备（Type 2/3）会通过 CXL.mem 协议向主机发起 BI 请求，以确保主机端缓存的任何冲突副本失效 5。
带宽与延迟:
如前所述，CXL 3.0 将链路速率提升至 64 GT/s，基于 PCIe 6.0 PHY 1。为了在更高速度下保持信号完整性，它采用了 PAM-4 调制和前向纠错 (FEC) 2。CXL 3.0 使用 256 字节的 Flit (Flow Control Unit) 格式 2，这与 CXL 1.x/2.0 的 68 字节 Flit 不同。
关于“零附加延迟”的声明 1，需要强调的是，这指的是与 CXL 2.0 (32 GT/s) 相比，CXL 3.0 (64 GT/s) 在链路层本身没有增加额外的延迟。CXL 3.0 甚至提供了一种延迟优化 (Latency-Optimized, LOpt) 的 Flit 模式，通过将 CRC 校验粒度减半（128 字节）来减少物理层的存储转发开销，可以节省 2-5 ns 的链路延迟，但会牺牲一定的链路效率和错误容忍度 2。然而，这并不意味着 CXL 内存的端到端访问延迟为零或与本地 DRAM 相同。CXL 互连本身、可能的交换机跳数以及 CXL 内存控制器都会引入显著的延迟，通常比本地 DRAM 访问慢数十到数百纳秒 12。因此，尽管 CXL 3.0 提供了更高的带宽，但延迟管理仍然是操作系统面临的关键挑战。

下表总结了 CXL 各主要版本之间的关键特性差异：

表 1: CXL 特性对比 (版本 1.x, 2.0, 3.x)

特性 (Feature)	CXL 1.0 / 1.1 (2019)	CXL 2.0 (2020)	CXL 3.0 (2022)	CXL 3.1/3.2 (2023/2024)
最大链路速率 (Max Link Rate)	32 GT/s (PCIe 5.0)	32 GT/s (PCIe 5.0)	64 GT/s (PCIe 6.0)	64 GT/s (PCIe 6.x)
Flit 大小 (Flit Size)	68B	68B	68B & 256B (标准 & LOpt)	68B & 256B
交换级别 (Switching Levels)	不支持	单级 (Single-level)	多级 (Multi-level)	多级
内存池化 (Memory Pooling)	不支持	支持 (通过 MLD)	增强支持 (Fabric, GFAM)	增强支持 (如 DCD)
内存共享 (Memory Sharing)	不支持	不支持 (硬件一致性)	支持 (硬件一致性)	支持
一致性机制 (Coherency Mechanism)	CXL.cache/mem	CXL.cache/mem (Bias-Based)	CXL.cache/mem (增强/对称, BI)	增强/对称, BI
点对点通信 (P2P Communication)	不支持	不支持	支持 (UIO + BI)	增强支持 (如 CXL.mem P2P)
Fabric 拓扑 (Fabric Topology)	点对点 (Point-to-Point)	树形 (Tree-based)	非树形 (Non-tree, Mesh, Ring, etc.)	增强 Fabric (PBR Scale-out)
最大节点数 (Max Nodes)	2	有限 (依赖单级交换机端口)	4096 (通过 PBR)	4096+ (PBR Scale-out)
每根端口多设备 (Multi-Device/Port)	不支持	不支持	支持 (Type 1/2)	支持
链路加密 (Link Encryption - IDE)	不支持	支持 (CXL IDE)	支持 (CXL IDE)	支持 (CXL IDE)
机密计算 (Confidential Computing)	不支持	不支持	不支持	支持 (TSP)
热页监控 (Hot Page Monitoring)	不支持	不支持	不支持	支持 (CHMU)
向后兼容性 (Backward Compatibility)	-	兼容 1.x	兼容 2.0, 1.x	兼容 3.0, 2.0, 1.x

数据来源: 1

CXL 3.0 引入的 Fabric、内存共享和 P2P 功能并非孤立存在，而是相互依存、共同构成了其核心价值。Fabric 架构 1 是实现大规模内存池化和共享的基础设施 1，支持灵活的拓扑和多级交换 1。内存共享则依赖于 CXL 3.0 增强的硬件一致性机制（如 BI）来保证数据正确性 2。P2P 通信同样受益于 Fabric 提供的灵活路由，并在访问共享设备内存 (HDM-DB) 时，需要 UIO 与 BI 协同工作以维持一致性 5。这种内在联系意味着操作系统在设计相关管理机制时，必须将这些特性视为一个整体，通盘考虑它们之间的交互和依赖关系，而不能孤立地处理某一个方面。例如，管理内存共享必须理解 Fabric 拓扑和一致性规则，而管理 P2P 则必须考虑 Fabric 路由和潜在的一致性影响。

3. 面向 CXL 3.0 的操作系统内存管理重构

CXL 3.0 带来的内存池化、共享和 Fabric 能力对传统的操作系统内存管理子系统提出了严峻挑战，同时也提供了前所未有的优化机遇。操作系统需要从根本上重新设计其内存管理策略，以适应这种新的内存层级和拓扑结构。

3.1 集成 CXL 内存: NUMA/zNUMA 模型与延迟

操作系统首先需要能够识别和集成 CXL 内存。当前主流的方法是将 CXL 内存设备（尤其是 Type 3 内存扩展器）抽象为无 CPU 的 NUMA (Non-Uniform Memory Access) 节点，通常称为 zNUMA (zero-core NUMA) 或 CPU-less NUMA 节点 27。这种抽象使得 CXL 内存能够相对容易地融入现有的 OS 内存管理框架，应用程序原则上可以像访问远端 NUMA 节点的内存一样访问 CXL 内存 39。

操作系统通过 ACPI (Advanced Configuration and Power Interface) 表来发现和理解 CXL 设备的拓扑结构和内存属性。关键的 ACPI 表包括：

SRAT (System Resource Affinity Table): 定义系统物理地址 (SPA) 范围与 NUMA 节点（包括 CXL zNUMA 节点）的亲和性 24。
CEDT (CXL Early Discovery Table): 提供 CXL Fabric 拓扑信息，包括 CXL 主机桥 (CHB)、交换机、端口以及它们之间的连接关系，还包含 CXL 固定内存窗口 (CFMW) 结构，描述平台预分配的、可用于映射 CXL 内存的 HPA (Host Physical Address) 窗口及其属性 24。
HMAT (Heterogeneous Memory Attribute Table): 提供不同内存域（包括本地 DRAM 和 CXL 内存）的性能特征，如读/写延迟和带宽信息，帮助 OS 做出更明智的内存放置决策 24。

尽管 zNUMA 模型提供了一种集成 CXL 内存的方式，但 CXL 内存的延迟特性与传统 NUMA 节点显著不同。访问 CXL 内存通常会引入比访问本地 DRAM 高得多的延迟。具体延迟值因 CXL 设备类型、连接方式（直连、单级交换、多级交换）、底层内存介质以及系统负载而异。研究和测量表明，CXL 内存访问延迟可能比本地 DRAM 慢 70-90ns（小型池化场景）57，甚至超过 180ns（机架级池化）57，通常是本地 DRAM 延迟的 2-3 倍 46，实测值在 140ns 到 410ns 甚至更高 12。此外，一些研究还观察到 CXL 设备可能存在显著的尾延迟（Tail Latency）问题，即少数访问的延迟远超平均值，这可能对延迟敏感型应用产生严重影响 104。

这种显著的延迟差异使得传统的、主要基于节点距离的 NUMA 管理策略（如 Linux 默认的 NUMA Balancing）在 CXL 环境下效果不佳，甚至可能因为不必要的页面迁移开销而损害性能 27。例如，NUMA Balancing 依赖的 NUMA hinting fault 机制在 CXL 场景下可能失效或效率低下 39。因此，操作系统需要超越简单的 zNUMA 抽象，采用更精细化的方法来管理 CXL 内存。

3.2 高级内存分层策略

鉴于 CXL 内存与本地 DRAM 之间显著的性能差异，内存分层 (Memory Tiering) 成为管理 CXL 内存的关键策略 12。其核心思想是将访问频繁的“热”数据放置在快速的本地 DRAM 层，而将访问较少的“冷”数据放置在容量更大但速度较慢的 CXL 内存层，从而在扩展内存容量的同时，最大限度地减少对应用程序性能的影响 12。

实现高效的内存分层需要解决两个核心问题：准确识别热/冷数据和低开销地迁移数据。

热度识别 (Profiling):
传统方法：许多早期或简单的分层系统依赖基于近时性 (Recency-based) 的方法，例如利用页表中的访问位 (Accessed Bit)。但这种方法不够准确，因为最近访问过的页面不一定是真正的热页面，尤其是在本地 DRAM 容量有限的情况下，可能导致错误的驱逐决策 120。
改进方法：基于频率 (Frequency-based) 的方法能更准确地识别热页，但传统的频率统计（如为每个页面维护计数器）会带来巨大的内存和运行时开销，尤其是在管理 TB 级内存时 120。
OS 级技术：Linux 内核提供了一些机制，如定期扫描 PTE (Page Table Entry) 的访问位或利用 NUMA Hint Faults 进行采样，但这些方法开销较大，且可能缺乏对 LLC (Last-Level Cache) 未命中的感知 27。使用硬件性能计数器 (如通过 perf 工具或 Intel TMA) 可以提供更精确的 CPU 行为信息，但将其直接映射到页面热度仍有挑战 100。
硬件辅助：为了克服 OS 级分析的开销和精度限制，研究人员提出了将分析功能卸载到硬件的方案。例如，NeoMem 项目提出在 CXL 设备控制器端集成 NeoProf 单元，直接监控对 CXL 内存的访问并向 OS 提供页面热度统计 96。CXL 3.2 规范也引入了 CHMU (CXL Hot-Page Monitoring Unit)，旨在标准化设备端的热页跟踪能力，为 OS 提供更高效的热度信息 23。FreqTier 则采用概率数据结构（Counting Bloom Filter）在软件层面以较低开销近似跟踪访问频率 120。
页面迁移 (Migration):
基本操作：内存分层涉及将页面在不同层级之间移动。提升 (Promotion) 指将热页从慢速层（CXL）移到快速层（本地 DRAM），降级 (Demotion) 指将冷页从快速层移到慢速层 27。
开销与挑战：页面迁移本身是有开销的，涉及页表解映射、数据拷贝和重映射等步骤 119。频繁或不当的迁移可能导致内存颠簸 (Thrashing)，反而降低性能 101。
优化技术：为了减少迁移开销，研究者提出了一些优化方法。异步迁移 (Asynchronous Migration) 将迁移操作移出应用程序的关键执行路径 119。事务性迁移 (Transactional Migration) 确保迁移过程的原子性 119。页面影印 (Page Shadowing)（如 NOMAD 系统采用）在将页面从慢速层提升到快速层后，在慢速层保留一个副本，当快速层内存压力大需要降级页面时，可以直接使用影子副本，避免了数据拷贝的开销 119。FreqTier 则根据应用的内存访问行为动态调整分层操作的强度，减少不必要的迁移流量和对应用的干扰 120。
具体实现与研究:
TPP (Transparent Page Placement): 由 Meta 开发并部分合入 Linux 内核 (v5.18+)，TPP 是一种 OS 级的透明页面放置机制 27。它采用轻量级的回收机制主动将冷页降级到 CXL 内存，为新分配（通常是热的）页面在本地 DRAM 中预留空间 (Headroom)。同时，它能快速地将误判或变热的页面从 CXL 内存提升回本地 DRAM，并尽量减少采样开销和不必要的迁移 27。
FreqTier: 采用基于硬件计数器和 Counting Bloom Filter 的频率分析方法，以低内存开销实现高精度的热页识别，并动态调整迁移强度 120。
NeoMem: 提出硬件/软件协同设计，在 CXL 设备控制器侧实现 NeoProf 硬件分析单元，为 OS 提供精确、低开销的热度信息 96。
NOMAD: 提出非独占式内存分层 (Non-exclusive Memory Tiering) 概念，通过页面影印和事务性迁移来缓解内存颠簸和迁移开销 119。
DAMON (Data Access MONitor): Linux 内核中的一个通用数据访问监控框架，可用于内存管理优化。近期有补丁提议为其增加 DAMOS_MIGRATE_HOT/COLD 操作，以支持基于 DAMON 的内存分层 130。
Intel Flat Memory Mode: 一种硬件管理的内存分层方案，在内存控制器 (MC) 中以缓存行粒度透明地管理本地 DRAM 和 CXL 内存之间的数据放置，对 OS 透明 24。虽然对 OS 简化，但缺乏灵活性，且可能在多租户环境中引发争用问题 105。

3.3 虚拟内存与页表影响

CXL 引入的异构内存层级也对虚拟内存系统和页表管理提出了新的挑战。

页表放置: 在传统的 NUMA 系统或包含 NVMM (Non-Volatile Main Memory) 的系统中，已经观察到如果页表自身的页面（Page Table Pages, PTPs）被放置在较慢的内存层，会导致页表遍历（Page Walk）延迟显著增加，从而影响应用程序性能，尤其是对于 TLB (Translation Lookaside Buffer) 未命中率高的大内存应用 131。CXL 内存的延迟特性使得这个问题更加突出。如果 OS 不加区分地将 PTPs 分配到 CXL 内存，将严重拖慢地址翻译过程。
解决方案: 需要 OS 采用显式的页表放置策略，将 PTPs 与普通数据页面区别对待，并优先将 PTPs 放置在最快的内存层（通常是本地 DRAM）131。即使在本地 DRAM 压力较大时，也应避免将 PTPs 驱逐到 CXL 内存，或者在 DRAM 空间可用时尽快将其迁回。研究工作如 Mitosis 提出了跨 NUMA 节点透明地复制和迁移页表的方法，以缓解页表遍历的 NUMA 效应，类似思想可应用于 CXL 环境 131。
CXL 共享内存与虚拟内存: CXL 3.0 引入的硬件一致性内存共享 2（或基于 CXL 2.0 池化内存的软件一致性共享 33）允许不同主机或同一主机上的不同进程映射和访问同一块物理内存区域。这对虚拟内存系统提出了新的要求：
跨域映射管理: OS 需要能够为不同主机/进程建立到同一 CXL 共享物理内存区域的虚拟地址映射。
一致性维护: 虽然 CXL 3.0 提供了硬件一致性，OS 仍需确保虚拟内存层面的映射和权限管理与底层硬件一致性状态协同工作。
地址空间管理: 在共享内存环境中，需要仔细管理虚拟地址空间，避免冲突，并提供有效的同步原语（可能利用 CXL 的原子操作支持）33。

3.4 OS 机制：CXL 内存池化与共享

操作系统需要提供明确的机制来支持和管理 CXL 的内存池化和共享功能。

内存池化 (CXL 2.0+):
资源发现与分配: OS 需要与 Fabric Manager (FM) 交互，发现可用的内存池资源，并根据应用程序或虚拟机的需求请求分配内存 5。这涉及到理解 MLD 的概念，并将分配到的逻辑设备内存集成到 OS 的内存视图中（通常作为 zNUMA 节点）。
动态容量管理: CXL 3.0/3.1 引入了动态容量设备 (Dynamic Capacity Devices, DCDs)，允许在运行时动态增减设备的可用容量，而无需重启或重新配置 79。OS 需要与 FM/Orchestrator 协同，平滑地处理这种容量变化，调整内存映射和管理结构。
高效分配/释放: OS 需要提供高效的机制来管理从池中分配到的内存，并在不再需要时将其释放回池中，以实现资源的高效利用 49。
内存共享 (CXL 3.0+):
共享区域映射: OS 需要提供接口，允许进程或跨主机的应用程序映射到指定的 CXL 共享物理内存区域。
利用硬件一致性: OS 应利用 CXL 3.0 提供的硬件一致性机制（如 Back-Invalidation）来简化共享内存编程模型，避免复杂的软件一致性协议 2。
与 CXL 2.0 对比: 需要区分 CXL 3.0 硬件一致性共享与基于 CXL 2.0 池化内存实现的软件一致性共享 33。后者需要 OS 或应用程序承担更多的一致性维护责任。
接口设计: OS 可以考虑扩展现有的 IPC 共享内存接口（如 System V SHM、POSIX SHM）或借鉴 HPC 中 OpenSHMEM 等模型的思想，来提供对 CXL 共享内存的访问 33。
性能与一致性权衡: 硬件一致性虽然简化了编程，但其协议开销（如 BI 流量、Snoop Filter 查找）可能成为性能瓶颈，尤其是在大规模共享或高争用场景下 73。

zNUMA 抽象虽然为 CXL 内存的初步集成提供了便利途径，但其粒度过于粗糙，无法充分反映 CXL 内存系统的复杂性和异构性 27。CXL 内存的实际性能（延迟、带宽、尾延迟）受到拓扑结构（直连、交换级数）、设备类型（ASIC/FPGA）、底层介质甚至工作负载模式的显著影响 39。简单的 NUMA 距离无法捕捉这些细微差别，导致基于此的默认策略（如 Linux NUMA Balancing）效果不佳 27。为了做出真正有效的内存放置和迁移决策，操作系统需要超越基本的 NUMA 模型，获取并利用更细粒度的信息，例如通过 ACPI HMAT 获取的性能数据、通过 CXL CDAT (Coherent Device Attribute Table) 获取的设备特征 67，或者通过 CXL 3.2 CHMU 等硬件监控单元获取的实时访问统计 23。这意味着 OS 需要更丰富的接口和内部模型来理解 CXL Fabric 的拓扑结构和各部分的性能特征。

有效的 CXL 内存分层不仅仅是简单地将冷页移到慢速层。为了保证对延迟敏感的应用或具有突发内存分配模式的工作负载的性能，主动管理快速层（本地 DRAM）至关重要。仅仅在内存压力出现时被动地降级页面可能导致新的、很可能是热的页面分配被迫进入慢速的 CXL 层，从而造成性能损失 27。Meta 的 TPP 设计明确强调了需要主动进行页面降级，以在快速层中保持足够的空闲空间（Headroom）来满足新的分配需求 27。NOMAD 系统也致力于将迁移操作移出关键路径 119。因此，操作系统分层算法应包含主动维护快速层空闲空间的机制，例如通过预测未来的分配需求，或者对较冷的页面采用更积极的降级策略，同时需要仔细权衡迁移成本。

CXL 3.0 提供的硬件一致性内存共享 2 极大地简化了多主机或多进程共享数据的编程模型 49。然而，这种便利性并非没有代价。底层的硬件一致性协议，特别是 Back-Invalidation 和 Snoop Filter，会引入额外的通信开销和潜在的可扩展性瓶颈，尤其是在大规模共享或高争用情况下 73。研究（如 CtXnL 73）表明，对于某些类型的数据访问（例如事务处理中的元数据访问），严格的硬件一致性可能是“过度设计 (overkill)”。在这种情况下，强制使用硬件一致性可能会牺牲性能。因此，未来的操作系统可能需要提供更灵活的一致性管理选项，例如允许应用程序为特定的共享内存区域选择性地放松一致性保证，或者提供接口让应用程序或中间件能够显式地管理一致性（类似于软件 DSM 的方式），从而在易用性和性能之间找到更好的平衡点，而不是采用“一刀切”的硬件一致性模型。

4. CXL Fabric 中的 OS 资源管理与调度

CXL 3.0 引入的 Fabric 架构将资源管理的范围从单个服务器节点扩展到了跨越多个节点、交换机和设备的互连结构。这要求操作系统具备 Fabric 感知能力，并采用新的资源管理和调度策略。

4.1 Fabric 感知 OS: 与 Fabric Manager 交互

CXL Fabric 的核心管理实体是 Fabric Manager (FM) 5。FM 是一个逻辑概念，负责配置 CXL 交换机、分配池化和共享资源（如将 MLD 的逻辑设备分配给主机、绑定交换机端口到主机的虚拟层级 VH）、管理设备热插拔、设置安全策略等高级系统操作 5。FM 的具体实现形式多样，可以嵌入在交换机固件中、作为主机上运行的管理软件，或集成在基板管理控制器 (BMC) 中 6。

操作系统需要与 FM 进行交互以实现对 Fabric 资源的有效管理。这种交互包括：

发现与拓扑感知: OS 需要能够发现 FM 的存在，并从 FM 获取 Fabric 的拓扑结构信息（哪些设备连接在哪些交换机端口，交换机如何互连等），以及资源的可用状态。
资源请求与释放: 当 OS 需要为应用程序或虚拟机分配来自 Fabric 的资源（如 CXL 内存池中的内存、共享的加速器）时，它需要向 FM 发出请求。同样，当资源不再需要时，OS 应通知 FM 以便释放。
动态配置管理: 对于支持动态容量的设备 (DCDs) 79，OS 需要与 FM/Orchestrator 协同处理容量变化事件。OS 也需要通过 FM 来管理 Fabric 中设备的热插拔和复位等生命周期事件 137。

OS 与 FM 之间的通信接口是实现 Fabric 感知 OS 的关键。CXL 规范定义了 FM API，可以通过组件命令接口 (Component Command Interface, CCI) 进行访问，而 CCI 可以通过 Mailbox (内存映射 I/O) 或 MCTP (Management Component Transport Protocol)（通常用于带外管理，如通过 I2C 或 VDM）传输 6。对于带内管理，OS 通常使用 Mailbox CCI。此外，一些外部 FM 实现可能提供 REST API 或 GUI 接口 134。

然而，当前 OS-FM 交互面临的主要挑战是缺乏统一且健壮的标准接口 11。不同的 FM 实现可能采用不同的接口和协议，导致 OS 需要适配多种机制，增加了复杂性并可能导致厂商锁定。此外，如何清晰地界定 OS 资源管理与 FM/Orchestrator 资源编排的职责边界，如何确保 OS 视图与 Fabric 实际状态的一致性，以及如何处理 FM 故障或不可用的情况，都是需要解决的关键问题 64。一个标准化的、功能完善的 OS-FM API 对于 CXL Fabric 的广泛应用至关重要，它需要覆盖资源发现、请求、配置、状态监控和事件通知等各个方面。

4.2 高级调度算法

传统的操作系统调度器主要关注单个节点内的 CPU 和内存资源，其决策基于本地 NUMA 拓扑和进程/线程状态。然而，在 CXL Fabric 环境中，内存和加速器等资源分布在整个 Fabric 中，访问延迟和带宽因路径和设备的不同而异。因此，需要开发新的 Fabric 感知调度算法 27。

延迟感知调度 (Latency-Aware Scheduling): 调度器应将任务（进程或线程）放置在能够以最低延迟访问其所需内存（无论是本地 DRAM 还是 CXL 内存池/共享区）和加速器的计算节点上 84。这需要调度器了解 Fabric 拓扑（例如，访问某个 CXL 内存需要经过多少跳交换机）84 并获取不同路径的延迟信息（可能通过 HMAT 或 FM 获取）104。仅仅依赖静态的 NUMA 距离是不够的。
带宽感知调度 (Bandwidth-Aware Scheduling): 调度器需要考虑 CXL 链路、交换机端口和内存设备本身的带宽限制 26。它应避免将过多带宽密集型任务调度到会争用同一链路或设备的位置，导致拥塞。对于需要大量 P2P 通信的任务，调度器应尝试将它们放置在 Fabric 中靠近的位置，或选择带宽充足的路径。研究如 Tiresias 提出了利用 Intel RDT 等技术为不同类型的工作负载（延迟敏感型 vs. 吞吐量敏感型）提供差异化的内存带宽分配，并利用 CXL 内存作为补充带宽资源 124。
局部性优化 (Locality Optimization): CXL 的核心优势之一是缓存一致性，它允许计算单元（CPU 或加速器）缓存远程数据，减少数据移动。调度器应利用这一点，将任务调度到尽可能靠近其工作集数据（无论数据在本地 DRAM、CXL 内存池还是共享区域）或所需加速器的位置 27。例如，Apta 系统为 FaaS 设计了感知对象位置的调度策略 144，CXL-ANNS 则根据图数据的访问模式进行调度和预取 148。
与内存分层集成: 调度决策应与内存分层策略紧密协调 27。例如，当内存分层系统将一个任务的热页面提升到某个节点的本地 DRAM 时，调度器应考虑将该任务也迁移到该节点以获得最佳性能。反之，如果一个任务被调度到某个节点，内存管理器应优先将该任务的热数据迁移到该节点的快速内存层。

一些研究项目已经开始探索这些方向。微软的 Pond 项目使用机器学习模型来预测 VM 的延迟敏感性和内存使用模式，以决定将其放置在本地 DRAM 还是 CXL 池化内存上，并分配适当的内存比例 57。EDM 提出了一种网络内调度机制，用于优化分解式内存系统的消息完成时间 143。这些研究表明，未来的调度器需要更智能，能够利用 Fabric 的拓扑信息、实时的性能遥测数据（可能来自 FM 或 CDAT）以及对工作负载特征的理解（可能通过在线分析或离线训练的模型）来做出复杂的放置决策。

4.3 通过 CXL 管理异构加速器 (Type 1/2 设备)

CXL 不仅用于内存扩展和池化 (Type 3 设备)，也为连接和管理异构加速器（如 GPU、FPGA、DPU、ASIC 等 Type 1 和 Type 2 设备）提供了统一的、高性能的接口 4。

操作系统在通过 CXL 管理这些加速器时扮演着关键角色：

发现与配置: 使用 CXL.io 协议发现连接的 Type 1/2 设备，读取其能力，并通过 Mailbox CCI 或其他机制进行配置 24。加载相应的设备驱动程序。
内存管理:
对于 Type 2 设备，OS 需要管理其设备自带的内存 (HDM-D 或 HDM-DB)，通过 CXL.mem 协议将其映射到主机的物理地址空间，并可能参与内存分层或作为 P2P 通信的目标 2。
利用 CXL.cache 协议，OS 可以使 Type 1/2 设备能够一致地访问和缓存主机内存，减少数据拷贝开销，实现主机与加速器之间更紧密的协作 3。
Fabric 中的资源分配: 在 CXL Fabric 环境中，加速器也可能被池化并通过交换机连接。OS 需要与 FM 交互，将特定的加速器资源动态地分配给需要它们的主机或任务 6。CXL 3.0 支持在单个根端口下连接多个 Type 1/2 设备，增加了连接密度和灵活性，也对 OS 的管理能力提出了更高要求 5。
调度考量: OS 调度器需要将计算任务与其所需的、可能分布在 Fabric 不同位置的加速器进行协同调度。同时，需要优化数据放置策略，例如，是将输入数据放在主机内存中让加速器通过 CXL.cache 访问，还是直接将数据加载到加速器的 HDM 中（如果可用且性能更优）。

CXL Fabric 环境下的延迟感知调度面临比传统 NUMA 感知调度更大的复杂性。简单的物理距离或 NUMA 节点 ID 不再能准确反映真实的访问成本。调度器必须综合考虑静态拓扑（如交换机跳数 38）和动态因素，如链路当前的负载和拥塞情况、目标 CXL 设备的类型和内部状态、以及 CXL 协议本身（尤其是一致性协议）带来的开销 5。CXL 内存和设备的性能本身也可能存在显著差异 89。因此，未来的 OS 调度器不能再依赖简化的模型，而需要更强大的感知能力，能够获取并利用详细的 CXL Fabric 拓扑信息、实时的性能遥测数据（可能通过 CDAT 67 或 FM 135 提供），并结合对工作负载延迟敏感性的理解（可能通过在线分析或预测模型 57），才能做出有效的、适应动态 Fabric 环境的调度决策。

5. 适配 OS I/O 子系统与设备管理

CXL 3.0 的 Fabric 拓扑和 P2P 通信能力对操作系统的 I/O 子系统和设备管理框架提出了新的要求。OS 需要能够发现、枚举、配置和管理在复杂、动态拓扑中的 CXL 设备，并支持新的通信模式。

5.1 复杂拓扑中的设备发现、枚举与配置

CXL 设备的发现和初始配置在很大程度上依赖于 CXL.io 协议，该协议基于并扩展了 PCIe 的机制 5。OS 通过标准的 PCIe 枚举流程扫描总线，并通过设备类代码 (Class Code)（例如 CXL 内存设备有特定类代码）和 CXL 定义的 DVSEC (Designated Vendor-Specific Extended Capabilities) 来识别 CXL 设备及其能力 24。需要注意的是，CXL 1.1 设备通常被枚举为根联合体集成端点 (RCiEP)，而 CXL 2.0 及更高版本的设备则被枚举为标准的 PCIe 端点，这影响了 OS 如何访问其配置空间和寄存器 67。

CXL 3.0 的 Fabric 架构给设备枚举带来了新的复杂性。在包含多级交换机的非树形拓扑中，OS 可能无法直接通过传统的 PCIe 扫描发现所有连接的设备 4。Fabric Manager (FM) 在这里扮演了重要角色，它可以提供 Fabric 的拓扑信息给 OS，帮助 OS 构建完整的设备视图 11。此外，大规模 Fabric 需要可扩展的寻址机制，PBR (Port Based Routing) 因此被引入，允许 Fabric 中的任意节点（最多 4096 个）相互寻址 2。OS 需要能够理解和使用 PBR 地址来进行设备定位和通信。

在 Linux 中，用户可以使用 lspci、cxl list 等命令或检查 /sys 文件系统来查看 CXL 设备和拓扑信息 24。内核中的 CXL 子系统（包含 cxl_core, cxl_pci, cxl_acpi 等模块）负责解析 ACPI 表（特别是 CEDT），发现 CXL 组件（主机桥、根端口、交换机、端点），并构建内核内部的拓扑表示 24。cxl_test 内核模块可用于在没有真实硬件的情况下仿真 CXL 拓扑以供测试 137。近期针对 AMD Zen5 平台的补丁还涉及处理 CXL 地址转换（HPA 到 SPA）的问题 155。

5.2 管理 CXL.io 与控制接口 (Mailbox CCI)

CXL.io 协议不仅用于初始发现和配置，也承载着运行时的控制和管理通信 24。OS 通过 CXL.io 发送非一致性加载/存储 (load/store) 命令来访问 CXL 设备的寄存器、报告错误以及使用 Mailbox 机制进行更复杂的交互 5。

组件命令接口 (Component Command Interface, CCI) 是 CXL 规范定义的用于管理 CXL 组件（设备、交换机等）的标准接口 6。CCI 定义了一系列命令集（如通用命令、内存设备命令、FM API 命令等）6。CCI 可以通过两种传输机制实现：

Mailbox CCI: 基于内存映射 I/O (MMIO) 的寄存器接口，通常位于设备的 PCIe BAR 空间中。OS 主要通过这种方式进行带内管理 6。Mailbox 通常分为 Primary 和 Secondary 两种，具有命令/状态寄存器、载荷寄存器，并可选支持中断 (MSI/MSI-X) 通知完成。对于耗时操作，CCI 支持后台命令 (Background Operations) 机制 6。
MCTP-based CCI: 将 CCI 命令封装在 MCTP 消息中，通过 I2C、VDM (Vendor Defined Message) 等带外通道传输。这主要用于 BMC 或外部 Fabric Manager 进行带外管理 6。

Linux CXL 子系统提供了对 Mailbox CCI 的支持。cxl_pci 驱动负责枚举设备的 Mailbox 寄存器接口，并将其注册到 cxl_core 137。内核提供 ioctl 接口供用户空间工具（如 cxl-cli 或使用 libcxlmi 库的应用）发送 CCI 命令 113。为了支持厂商特定的功能或固件更新等操作，内核还提供了 CONFIG_CXL_MEM_RAW_COMMANDS 选项以允许发送未经内核校验的原始 Mailbox 命令 94。QEMU 也提供了对 CXL Mailbox 的仿真支持 112。

5.3 启用和管理点对点 I/O (UIO)

CXL 3.0 的 P2P 通信能力允许设备直接访问 Fabric 中其他设备的内存（特别是 HDM-DB），这依赖于 Unordered I/O (UIO) 机制 5。UIO 允许 P2P 流量在某些条件下绕过严格的 PCIe 排序规则，从而可能获得更好的性能 30。

操作系统的角色包括：

能力协商与启用: OS 需要识别设备和路径是否支持 UIO，并进行必要的配置以启用该功能。
路由配置: OS（可能需要与 FM 协作）需要配置 Fabric 中的交换机和端口，以允许 UIO 流量在 P2P 端点之间正确路由（可能使用 PBR）2。
一致性管理: 如前所述，当 UIO 用于访问可能被缓存的 HDM-DB 时，OS 需要确保一致性得到维护。这可能涉及到协调目标设备发起的 Back-Invalidation (BI) 流程 5。
接口提供: OS 需要向上层（应用程序或驱动程序）提供发起和管理 P2P UIO 传输的接口。

目前，UIO P2P 仍然面临一些挑战。CXL 规范本身对 UIO P2P 访问的保护机制规定不足 35。在复杂的 Fabric 中管理 P2P 路由和一致性可能非常复杂。从 Linux 内核的 CXL 成熟度图来看，对 Fabric 和 GFAM 的支持仍处于早期阶段 ( 分)，意味着对 UIO P2P 的完整支持可能尚未实现 98。此外，UIO 放松的排序规则可能给 OS 或应用程序带来额外的复杂性，需要确保数据一致性和正确性 30。

CXL 引入的多协议（.io,.cache,.mem）、多设备类型（Type 1/2/3, MLD, GFAM）、动态 Fabric 拓扑以及新的管理接口（CCI, FM API）5 使得 CXL 设备管理比传统的 PCIe 设备管理复杂得多。简单的基于树状总线的枚举和配置模型不再适用。操作系统需要一个更加复杂和动态的设备模型，能够理解 Fabric 拓扑，处理不同协议和设备类型的交互，并与 Fabric Manager 协同工作。Linux CXL 子系统的设计 [24, S_

Works cited

CXL Consortium releases Compute Express Link 3.0 specification to expand fabric capabilities and management, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2024/01/CXL_3.0-Specification-Release_FINAL-1.pdf
Compute Express Link 3.0 - Design And Reuse, accessed April 19, 2025, https://www.design-reuse.com/articles/52865/compute-express-link-3-0.html
What is Compute Express Link (CXL) 3.0? - Synopsys, accessed April 19, 2025, https://www.synopsys.com/blogs/chip-design/what-is-compute-express-link-3.html
Understanding How CXL 3.0 Links the Data Center Fabric - Industry Articles, accessed April 19, 2025, https://www.allaboutcircuits.com/industry-articles/understanding-how-cxl-3.0-links-the-data-center-fabric/
CXL 3.0: Enabling composable systems with expanded fabric capabilities - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2023/12/CXL_3.0-Webinar_FINAL.pdf
CXL Fabric Management - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2023/12/20220322_CXL_FM_Webinar_Final.pdf
CXL – GAMECHANGER FOR THE DATA CENTER - Dell Learning, accessed April 19, 2025, https://learning.dell.com/content/dam/dell-emc/documents/en-us/2023KS_Jaiswal-CXL_Gamechanger_for_the_Data_Center.pdf
CXL 3.0 and the Future of AI Data Centers | Keysight Blogs, accessed April 19, 2025, https://www.keysight.com/blogs/en/inds/ai/cxl-3-0-and-the-future-of-ai-data-centers
Orchestrating memory disaggregation with Compute Express Link (CXL) - Intel, accessed April 19, 2025, https://cdrdv2-public.intel.com/817889/omdia%E2%80%93orchestrating-memory-disaggregation-cxl-ebook.pdf
Reimagining the Future of Data Computing with Compute Express Link (CXL) Tech-Enabled Interconnects from Amphenol, accessed April 19, 2025, https://www.amphenol-cs.com/connect/reimagining-the-future-of-data-computing-with-cxl-tech-enabled-interconnect.html
Introducing the CXL 3.0 Specification - SNIA SDC 2022, accessed April 19, 2025, https://www.sniadeveloper.org/sites/default/files/SDC/2022/pdfs/SNIA-SDC22-Agarwal-CXL-3.0-Specification.pdf
CXL Memory Expansion: A Closer Look on Actual Platform - Micron Technology, accessed April 19, 2025, https://www.micron.com/content/dam/micron/global/public/products/white-paper/cxl-memory-expansion-a-close-look-on-actual-platform.pdf
Compute Express Link(CXL), the next generation interconnect, accessed April 19, 2025, https://www.fujitsu.com/jp/documents/products/software/os/linux/catalog/NVMSA_CXL_overview_and_the_status_of_Linux.pdf
Memory-Centric Computing - Ethz, accessed April 19, 2025, https://people.inf.ethz.ch/omutlu/pub/onur-IEDM-3-4-Monday-MemoryCentricComputing-InvitedTalk-9-December-2024.pdf
Databases in the Era of Memory-Centric Computing - VLDB Endowment, accessed April 19, 2025, https://www.vldb.org/cidrdb/papers/2025/p6-chronis.pdf
Memory-centric Computing Systems: What's Old Is New Again - SIGARCH, accessed April 19, 2025, https://www.sigarch.org/memory-centric-computing-systems-whats-old-is-new-again/
Next-Gen Interconnection Systems with Compute Express Link: a Comprehensive Survey, accessed April 19, 2025, https://arxiv.org/html/2412.20249v1
How Flexible is CXL's Memory Protection? - ACM Queue, accessed April 19, 2025, https://queue.acm.org/detail.cfm?id=3606014
How Flexible is CXL's Memory Protection? - University of Cambridge, accessed April 19, 2025, https://www.repository.cam.ac.uk/bitstreams/c56e69c4-e7d8-47a8-9cb3-769345eb0f8a/download
CXL 3.0 - Everything You Need To Know [2023] - Logic Fruit Technologies, accessed April 19, 2025, https://www.logic-fruit.com/blog/cxl/cxl-3-0/
CXL 1.0, 1.1. 2.0 3.0 - Compute Express Link - Serverparts.pl, accessed April 19, 2025, https://www.serverparts.pl/en/blog/cxl-10-11-20-30-compute-express-link-1
Compute Express Link (CXL): All you need to know - Rambus, accessed April 19, 2025, https://www.rambus.com/blogs/compute-express-link/
About CXL® - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/about-cxl/
Implementing CXL Memory on Linux on ThinkSystem V4 Servers - Lenovo Press, accessed April 19, 2025, https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers
Compute Express Link - Wikipedia, accessed April 19, 2025, https://en.wikipedia.org/wiki/Compute_Express_Link
Exploring Performance and Cost Optimization with ASIC-Based CXL Memory - OpenReview, accessed April 19, 2025, https://openreview.net/pdf?id=cJOoD0jx6b
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory - SymbioticLab, accessed April 19, 2025, https://symbioticlab.org/publications/files/tpp:asplos23/tpp-asplos23.pdf
Welcome to the Linux CXL documentation — CXL documentation, accessed April 19, 2025, https://linux-cxl.readthedocs.io/
An Introduction to Compute Express Link (CXL) - MemVerge, accessed April 19, 2025, https://memverge.com/wp-content/uploads/2022/10/CXL-Forum-Wall-Street_MemVerge.pdf
CXL Thriving As Memory Link - Semiconductor Engineering, accessed April 19, 2025, https://semiengineering.com/cxl-thriving-as-memory-link/
Verifying CXL 3.1 Designs with Synopsys Verification IP, accessed April 19, 2025, https://www.synopsys.com/blogs/chip-design/verifying-cxl3-1-designs-with-synopsys-verification-ip.html
Memory Sharing with CXL: Hardware and Software Design Approaches, accessed April 19, 2025, https://hcds-workshop.github.io/edition/2024/resources/Memory-Sharing-Jain-2024.pdf
Memory Sharing with CXL: Hardware and Software Design Approaches - arXiv, accessed April 19, 2025, https://arxiv.org/html/2404.03245v1
Memory Sharing with CXL: Hardware and Software Design Approaches - arXiv, accessed April 19, 2025, https://arxiv.org/pdf/2404.03245
How Flexible Is CXL's Memory Protection? - Communications of the ACM, accessed April 19, 2025, https://cacm.acm.org/practice/how-flexible-is-cxls-memory-protection/
CXL (Compute Express Link) Technology - Scientific Research Publishing, accessed April 19, 2025, https://www.scirp.org/journal/paperinformation?paperid=126038
What is Compute Express Link (CXL)? - Trenton Systems, accessed April 19, 2025, https://www.trentonsystems.com/en-us/resource-hub/blog/what-is-compute-express-link-cxl
Fabric Technology Required for Composable Memory - IntelliProp, accessed April 19, 2025, https://www.intelliprop.com/wp-content/uploads/2022/11/Composable-Memory-requires-a-Fabric-White-Paper.pdf
Exploring and Evaluating Real-world CXL: Use Cases and System Adoption - arXiv, accessed April 19, 2025, https://arxiv.org/html/2405.14209v3
Implementing CXL Memory on Linux on ThinkSystem V4 Servers - Lenovo Press, accessed April 19, 2025, https://lenovopress.lenovo.com/lp2184.pdf
Octopus: Scalable Low-Cost CXL Memory Pooling | Request PDF - ResearchGate, accessed April 19, 2025, https://www.researchgate.net/publication/388067880_Octopus_Scalable_Low-Cost_CXL_Memory_Pooling
Designing for the Future of System Architecture With CXL and Intel in the ATC - WWT, accessed April 19, 2025, https://www.wwt.com/article/designing-for-the-future-of-system-architecture-with-cxl-and-intel-in-the-atc
CXL: The Future Of Memory Interconnect? - Semiconductor Engineering, accessed April 19, 2025, https://semiengineering.com/cxl-the-future-of-memory-interconnect/
[2411.02282] A Comprehensive Simulation Framework for CXL Disaggregated Memory - arXiv, accessed April 19, 2025, https://arxiv.org/abs/2411.02282
Compute Express Link (CXL) - Ayar Labs, accessed April 19, 2025, https://ayarlabs.com/glossary/compute-express-link-cxl/
Architectural and System Implications of CXL-enabled Tiered Memory - arXiv, accessed April 19, 2025, https://arxiv.org/html/2503.17864v1
CXL 2.0 and 3.0 for Storage and Memory Applications | Synopsys, accessed April 19, 2025, https://www.synopsys.com/designware-ip/technical-bulletin/cxl2-3-storage-memory-applications.html
A CXL-Powered Database System: Opportunities and Challenges, accessed April 19, 2025, https://dbgroup.cs.tsinghua.edu.cn/ligl//papers/CXL_ICDE.pdf
Explaining CXL Memory Pooling and Sharing - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/blog/explaining-cxl-memory-pooling-and-sharing-1049/
CXL 3.0: Revolutionizing Data Centre Memory - Optimize Performance & Reduce Costs, accessed April 19, 2025, https://www.ruijienetworks.com/support/tech-gallery/cxl3-0-solving-new-memory-problems-in-data-centres-part2
An Open Industry Standard for Composable Computing - Compute Express LinkTM (CXL™), accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2023/12/CXL_FMS-2023-Tutorial_FINAL.pdf
NVM Express® Support for CXL, accessed April 19, 2025, https://nvmexpress.org/wp-content/uploads/02_Martin-and-Molgaard_NVMe-Support-for-CXL_Final.pdf
CXL Consortium Releases Compute Express Link 3.0 Specification to Expand Fabric Capabilities and Management - Business Wire, accessed April 19, 2025, https://www.businesswire.com/news/home/20220802005028/en/CXL-Consortium-Releases-Compute-Express-Link-3.0-Specification-to-Expand-Fabric-Capabilities-and-Management
Compute Express Link (CXL) 3.0 Debuts, Wins CPU Interconnect Wars | Tom's Hardware, accessed April 19, 2025, https://www.tomshardware.com/news/cxl-30-debuts-one-cpu-interconnect-to-rule-them-all
CXL 3.0 Specification Released - Doubles The Data Rate Of CXL 2.0 - Phoronix, accessed April 19, 2025, https://www.phoronix.com/news/CXL-3.0-Specification-Released
CXL 3.0: Enabling composable systems with expanded fabric capabilities - YouTube, accessed April 19, 2025, https://www.youtube.com/watch?v=CIjDpazbtUU
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms - Microsoft, accessed April 19, 2025, https://www.microsoft.com/en-us/research/wp-content/uploads/2022/10/Pond-ASPLOS23.pdf
Memory Disaggregation: Open Challenges in the Era of CXL - SymbioticLab, accessed April 19, 2025, https://symbioticlab.org/publications/files/disaggregation-future:hotinfra23/memory-disaggregation-hotinfra23.pdf
Beyond processor-centric operating systems | USENIX, accessed April 19, 2025, https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-faraboschi.pdf
CXL 3.0 and Beyond: Advancements in Memory Management and Connectivity, accessed April 19, 2025, https://www.h3platform.com/blog-detail/58
An Examination of CXL Memory Use Cases for In-Memory Database Management Systems using SAP HANA - VLDB Endowment, accessed April 19, 2025, https://www.vldb.org/pvldb/vol17/p3827-ahn.pdf
Compute Express Link™ (CXL ™) Device Ecosystem and Usage Models, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2023/12/CXL_FMS-Panel-2023_FINAL.pdf
Panmnesia intros CXL 3.0-enabled memory sharing AI accelerator - Blocks and Files, accessed April 19, 2025, https://blocksandfiles.com/2023/11/27/panmnesia-has-cxl-3-0-enabled-memory-sharing-ai-accelerator/
CXL Fabric Management - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/blog/cxl-fabric-management-1089/
CXL 2.0 / PCIe Gen 5 - The Future of Composable Infrastructure - H3 Platform, accessed April 19, 2025, https://www.h3platform.com/blog-detail/29
CXL - Blocks and Files, accessed April 19, 2025, https://blocksandfiles.com/2022/04/20/cxl/
CXL Glossary - Rambus, accessed April 19, 2025, https://www.rambus.com/interface-ip/cxl-glossary/
Integrity and Data Encryption (IDE) Trends and Verification Challenges in CXL® (Compute Express Link®), accessed April 19, 2025, https://computeexpresslink.org/blog/integrity-and-data-encryption-ide-trends-and-verification-challenges-in-cxl-compute-express-link-2797/
Compute Express Link (CXL) 3.0 Announced: Doubled Speeds and Flexible Fabrics, accessed April 19, 2025, https://www.anandtech.com/show/17520/compute-express-link-cxl-30-announced-doubled-speeds-and-flexible-fabrics
CXL 3.0 Scales the Future Data Center - Verification - Cadence Blogs, accessed April 19, 2025, https://community.cadence.com/cadence_blogs_8/b/fv/posts/cxl-3-0-scales-the-future-data-center
Compute Express Link™(CXL™) 3.0: Expanded capabilities for increasing scale and optimizing resource utilization - SNIA CMSS, accessed April 19, 2025, https://www.snia.org/sites/default/files/cmss/2023/SNIA-CMSS23-Rudoff-CXL-Expanded-Capabilities.pdf
Introducing the CXL 3.1 Specification - Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2024/03/CXL_3.1-Webinar-Presentation_Feb_2024.pdf
Enabling Efficient Transaction Processing on CXL-Based Memory Sharing - arXiv, accessed April 19, 2025, https://arxiv.org/html/2502.11046v1
Compute Express Link (CXL)*: An open interconnect for HPC and AI applications - NOWLAB, accessed April 19, 2025, http://nowlab.cse.ohio-state.edu/static/media/workshops/presentations/exacomm24/Exacomm_ISC_2024_CXL_Debendra.pdf
SDC2022 – Introducing CXL 3.0: Expanded Capabilities for Increased Scale and Optimized Resource Util - YouTube, accessed April 19, 2025, https://www.youtube.com/watch?v=X1sAyKo_28I
CXL Standard Evolution: From CXL 2.0 to 3.1 | Synopsys Blog, accessed April 19, 2025, https://www.synopsys.com/blogs/chip-design/cxl-3-1-standard.html
Hundreds of servers could share external memory pools across Panmnesia CXL fabric, accessed April 19, 2025, https://blocksandfiles.com/2024/08/01/panmnesia-cxl-fabric/
Linux 6.14 CXL Updates Make Preparations Around Type 2 Support & CXL 3.1 - Phoronix, accessed April 19, 2025, https://www.phoronix.com/news/Linux-6.14-CXL
UnifabriX taking CXL external memory mainstream - Blocks and Files, accessed April 19, 2025, https://blocksandfiles.com/2025/01/15/unifabrix-taking-cxl-external-memory-mainstream/
CXL Update Emphasizes Security [Byline] - Gary Hilson, accessed April 19, 2025, https://hilson.ca/cxl-update-emphasizes-security-byline/
CXL Update Emphasizes Security - EE Times, accessed April 19, 2025, https://www.eetimes.com/cxl-update-emphasizes-security/
Unlocking CXL's Potential: Revolutionizing Server Memory and Performance - SNIA.org, accessed April 19, 2025, https://snia.org/sites/default/files/CMSC/2025-0326_Unlocking_CXL_Webinar_Final.pdf
What are the new features in the CXL 3.0 specification? - ASTERA LABS, INC., accessed April 19, 2025, https://www.asteralabs.com/faqs/what-are-the-new-features-in-cxl-3-0-specification/
A Case Against CXL Memory Pooling - Events, accessed April 19, 2025, https://conferences.sigcomm.org/hotnets/2023/papers/hotnets23_levis.pdf
Octopus: Scalable Low-Cost CXL Memory Pooling - arXiv, accessed April 19, 2025, https://arxiv.org/pdf/2501.09020
CXL GFAM Global Fabric Attached Memory Device - ServeTheHome, accessed April 19, 2025, https://www.servethehome.com/compute-express-link-cxl-3-0-is-the-exciting-building-block-for-disaggregation/cxl-gfam-global-fabric-attached-memory-device/
Logical Memory Pools: Flexible and Local Disaggregated Memory, accessed April 19, 2025, https://conferences.sigcomm.org/hotnets/2023/papers/hotnets23_amaro.pdf
How CXL and Memory Pooling Reduce HPC Latency | Synopsys Blog, accessed April 19, 2025, https://www.synopsys.com/blogs/chip-design/cxl-protocol-memory-pooling.html
A Comprehensive Simulation Framework for CXL Disaggregated Memory - arXiv, accessed April 19, 2025, https://arxiv.org/html/2411.02282v2
Exploring and Evaluating Real-world CXL: Use Cases and System Adoption - arXiv, accessed April 19, 2025, https://arxiv.org/html/2405.14209v1
Glossary — Intel Unified Memory Framework 0.12.0 documentation - GitHub Pages, accessed April 19, 2025, https://oneapi-src.github.io/unified-memory-framework/glossary.html
A Comprehensive Simulation Framework for CXL Disaggregated Memory - arXiv, accessed April 19, 2025, https://arxiv.org/html/2411.02282v5
Arm CMN S3: Driving CXL Storage Innovation - Servers and Cloud Computing blog, accessed April 19, 2025, https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/arm-cmn-s3-driving-cxl-storage-innovation
Compute Express Link - Neoverse Reference Design Platform Software - Arm, accessed April 19, 2025, https://neoverse-reference-design.docs.arm.com/en/latest/features/cxl.html
CXL Security (Training) - MindShare, accessed April 19, 2025, https://www.mindshare.com/Learn/CXL_Security
NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering - arXiv, accessed April 19, 2025, https://arxiv.org/html/2403.18702v2
Formalising CXL Cache Coherence - Imperial College London, accessed April 19, 2025, https://www.doc.ic.ac.uk/~afd/papers/2025/ASPLOS-CXL.pdf
Compute Express Link Subsystem Maturity Map - The Linux Kernel documentation, accessed April 19, 2025, https://docs.kernel.org/driver-api/cxl/maturity-map.html
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms - Microsoft, accessed April 19, 2025, https://www.microsoft.com/en-us/research/wp-content/uploads/2022/10/2023_Pond_asplos23_official_asplos_version.pdf
Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization - arXiv, accessed April 19, 2025, https://arxiv.org/html/2409.14317v1
Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices - arXiv, accessed April 19, 2025, https://arxiv.org/pdf/2303.15375
CXL Memory Pooling will Save Millions in DRAM Cost | TechPowerUp Forums, accessed April 19, 2025, https://www.techpowerup.com/forums/threads/cxl-memory-pooling-will-save-millions-in-dram-cost.296786/
Architectural and System Implications of CXL-enabled Tiered Memory - arXiv, accessed April 19, 2025, https://arxiv.org/html/2503.17864v2
Systematic CXL Memory Characterization and Performance Analysis at Scale - People, accessed April 19, 2025, https://people.cs.vt.edu/jinshu/docs/papers/Melody_ASPLOS.pdf
Managing Memory Tiers with CXL in Virtualized Environments - USENIX, accessed April 19, 2025, https://www.usenix.org/system/files/osdi24-zhong-yuhong.pdf
FPGA-based Emulation and Device-Side Management for CXL-based Memory Tiering Systems - arXiv, accessed April 19, 2025, https://arxiv.org/html/2502.19233v2
Architectural and System Implications of CXL-enabled Tiered Memory - arXiv, accessed April 19, 2025, https://arxiv.org/pdf/2503.17864
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms (ASPLOS'23) - GitHub, accessed April 19, 2025, https://github.com/MoatLab/Pond
A Comprehensive Simulation Framework for CXL Disaggregated Memory | Request PDF, accessed April 19, 2025, https://www.researchgate.net/publication/385560282_A_Comprehensive_Simulation_Framework_for_CXL_Disaggregated_Memory
www.eetimes.com, accessed April 19, 2025, [https://www.eetimes.com/cxl-update-emphasizes-security/#:~:text=The%20trusted%20security%20protocol%20(TSP,to%20host%20confidential%20computing%20workloads.](https://www.eetimes.com/cxl-update-emphasizes-security/#:~:text=The trusted security protocol (TSP,to host confidential computing workloads.)
Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization - arXiv, accessed April 19, 2025, https://arxiv.org/pdf/2409.14317
Compute Express Link (CXL) — QEMU documentation, accessed April 19, 2025, https://www.qemu.org/docs/master/system/devices/cxl.html
A Practical Guide to Identify Compute Express Link (CXL) Devices in Your Server, accessed April 19, 2025, https://stevescargall.com/blog/2023/05/a-practical-guide-to-identify-compute-express-link-cxl-devices-in-your-server/
Toward CXL-Native Memory Tiering via Device-Side Profiling - arXiv, accessed April 19, 2025, https://arxiv.org/html/2403.18702v1
Beware, PCIe Switches! CXL Pools Are Out to Get You - arXiv, accessed April 19, 2025, https://arxiv.org/html/2503.23611v1
9.2. Automatic NUMA Balancing | Red Hat Product Documentation, accessed April 19, 2025, https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-auto_numa_balancing
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory - Meta Research, accessed April 19, 2025, https://research.facebook.com/publications/tpp-transparent-page-placement-for-cxl-enabled-tiered-memory/
[2206.02878] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory - arXiv, accessed April 19, 2025, https://arxiv.org/abs/2206.02878
Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration - USENIX, accessed April 19, 2025, https://www.usenix.org/system/files/osdi24-xiang.pdf
Lightweight Frequency-Based Tiering for CXL Memory Systems - arXiv, accessed April 19, 2025, https://arxiv.org/html/2312.04789v1
CXL Memory Disaggregation and Tiering: Lessons Learned from Storage - SNIA.org, accessed April 19, 2025, https://www.snia.org/educational-library/cxl-memory-disaggregation-and-tiering-lessons-learned-storage-2023
NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering - Microsoft, accessed April 19, 2025, https://www.microsoft.com/en-us/research/publication/neomem-hardware-software-co-design-for-cxl-native-memory-tiering/
Managing Memory Tiers with CXL in Virtualized Environments - USENIX, accessed April 19, 2025, https://www.usenix.org/conference/osdi24/presentation/zhong-yuhong
Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling - Temple CIS, accessed April 19, 2025, https://cis.temple.edu/~wu/research/publications/Publication_files/acmturc24-12.pdf
Using Linux Kernel Tiering with Compute Express Link (CXL) Memory - Steve Scargall, accessed April 19, 2025, https://stevescargall.com/blog/2024/05/using-linux-kernel-tiering-with-compute-express-link-cxl-memory/
Re: [PATCH -V2] cxl/region: Support to calculate memory tier abstract distance - The Linux-Kernel Archive, accessed April 19, 2025, https://lkml.iu.edu/2406.1/04010.html
MemVerge: Homepage, accessed April 19, 2025, https://memverge.com/
Memory Machine – CXL - MemVerge, accessed April 19, 2025, https://memverge.com/memory-machine-cxl/
Breaking Memory Barriers | Compute Express Link, accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2024/12/CXL-Breaking-Memory-Barriers-Webinar.pdf
Re: [PATCH v5 0/8] DAMON based tiered memory management for CXL memory - The Linux-Kernel Archive, accessed April 19, 2025, https://lkml.iu.edu/2406.1/07403.html
[PDF] Page Table Management for Heterogeneous Memory Systems - Semantic Scholar, accessed April 19, 2025, https://www.semanticscholar.org/paper/7f81a1c543e07f892fe10d00e1781eace1592f67
[2103.10779] Page Table Management for Heterogeneous Memory Systems - arXiv, accessed April 19, 2025, https://arxiv.org/abs/2103.10779
Unleashing the Future of Memory Management: Exploring CXL Dynamic Capacity Devices with Docker and QEMU - MemVerge, accessed April 19, 2025, https://memverge.com/unleashing-the-future-of-memory-management/
Jackrabbit Labs - the Future of Memory and Storage, accessed April 19, 2025, https://files.futurememorystorage.com/proceedings/2024/20240806_CXLT-102-1_Mackey.pdf
OCP CMS Logical System Architecture White Paper - Open Compute Project, accessed April 19, 2025, https://www.opencompute.org/documents/ocp-cms-logical-system-architecture-white-paper-pdf-1
Introducing Omega Fabric Based on CXL - IntelliProp, accessed April 19, 2025, https://www.intelliprop.com/products-page
Compute Express Link Memory Devices - The Linux Kernel documentation, accessed April 19, 2025, https://docs.kernel.org/driver-api/cxl/memory-devices.html
LPC2022: Meta's CXL Journey and Learnings - Linux Plumbers Conference, accessed April 19, 2025, [https://lpc.events/event/16/contributions/1207/attachments/950/1866/LPC2022_%20Meta's%20CXL%20Journey%20and%20Learnings.pdf](https://lpc.events/event/16/contributions/1207/attachments/950/1866/LPC2022_ Meta's CXL Journey and Learnings.pdf)
Compute Express Link Memory Devices - The Linux Kernel Archives, accessed April 19, 2025, https://www.kernel.org/doc/html/v6.1/driver-api/cxl/memory-devices.html
CXL Fabric Management Standards | PPT - SlideShare, accessed April 19, 2025, https://www.slideshare.net/slideshow/cxl-fabric-management-standards/262688178
libcxlmi a CXL Management Interace library - Linux Plumbers Conference, accessed April 19, 2025, https://lpc.events/event/18/contributions/1876/attachments/1441/3072/lpc24-dbueso-libcxlmi.pdf
computexpresslink/libcxlmi: CXL Management Interface library - GitHub, accessed April 19, 2025, https://github.com/computexpresslink/libcxlmi
EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation - arXiv, accessed April 19, 2025, https://arxiv.org/html/2411.08300v4
¯Apta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS - cs.utah.edu, accessed April 19, 2025, https://users.cs.utah.edu/~vijay/papers/dsn23.pdf
EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation - arXiv, accessed April 19, 2025, https://arxiv.org/html/2411.08300v1
arXiv:2203.00241v4 [cs.OS] 21 Oct 2022, accessed April 19, 2025, https://arxiv.org/pdf/2203.00241
Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters - arXiv, accessed April 19, 2025, https://arxiv.org/html/2503.20275v1
CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search - USENIX, accessed April 19, 2025, https://www.usenix.org/system/files/atc23-jang.pdf
SkyByte: Architecting an Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design - arXiv, accessed April 19, 2025, https://arxiv.org/html/2501.10682v1
CXL Enumeration: How Are Devices Discovered in System Fabric? - Cadence Blogs, accessed April 19, 2025, https://community.cadence.com/cadence_blogs_8/b/fv/posts/cxl-enumeration-how-are-devices-discovered-in-system-fabric
The Fascinating Path of CXL 2.0 Device Discovery | Synopsys Blog, accessed April 19, 2025, https://www.synopsys.com/blogs/chip-design/cxl-2-device-discovery-path.html
Compute Express Link Memory Devices - The Linux Kernel documentation, accessed April 19, 2025, https://docs.kernel.org/6.3/driver-api/cxl/memory-devices.html
cxl-reskit/cxl-reskit: CXL Memory Resource Kit top-level repository - GitHub, accessed April 19, 2025, https://github.com/cxl-reskit/cxl-reskit
cxl(1) - NDCTL User Guide, accessed April 19, 2025, https://docs.pmem.io/ndctl-user-guide/v72.1/cxl-man-pages/cxl-1
CXL Address Translation Support For AMD Zen 5 Sees Linux Patches - Phoronix, accessed April 19, 2025, https://www.phoronix.com/news/AMD-Zen5-CXL-Translation-v1
Compute Express LinkTM (CXLTM), accessed April 19, 2025, https://computeexpresslink.org/wp-content/uploads/2024/02/20210812_Type3_MGMT_using_MCTP_CCI_ECN-errata-update20211116.pdf)))

March 17, 2025April 19, 2025

GTC Beyond CUDA

1. Introduction

1.1 Setting the Stage: NVIDIA's CUDA and its Dominance in AI Compute

NVIDIA Corporation, initially renowned for its graphics processing units (GPUs) powering the gaming industry, strategically pivoted over the last two decades to become the dominant force in artificial intelligence (AI) computing. A cornerstone of this transformation was the introduction of the Compute Unified Device Architecture (CUDA) in 2006. CUDA is far more than just a programming language; it represents NVIDIA's proprietary parallel computing platform and a comprehensive software ecosystem, encompassing compilers, debuggers, profilers, extensive libraries (like cuDNN for deep learning and cuBLAS for linear algebra), and development tools. This ecosystem unlocked the potential of GPUs for general-purpose processing (GPGPU), enabling developers to harness the massive parallelism inherent in NVIDIA hardware for computationally intensive tasks far beyond graphics rendering.

This strategic focus on software and hardware synergy has propelled NVIDIA to a commanding position in the AI market. Estimates consistently place NVIDIA's share of the AI accelerator and data center GPU market between 70% and 95%, with recent figures often citing 80% to 92% dominance. This market leadership is reflected in staggering financial growth, with data center revenue surging, exemplified by figures like $18.4 billion in a single quarter of 2023. High-performance GPUs like the A100, H100, and the upcoming Blackwell series have become the workhorses for training and deploying large-scale AI models, utilized by virtually all major technology companies and research institutions, including OpenAI, Google, and Meta. Consequently, CUDA has solidified its status as the de facto standard programming environment for GPU-accelerated computing, particularly within the AI domain, underpinning widely used frameworks like PyTorch and TensorFlow.

1.2 The Emerging "Beyond CUDA" Narrative: GTC Insights and Industry Momentum

Despite NVIDIA's entrenched position, a narrative exploring computational pathways "Beyond CUDA" is gaining traction, even surfacing within NVIDIA's own GPU Technology Conference (GTC) events. The focus of the provided GTC video segment, starting from the 5 minute 27 second mark, on alternatives signifies that the discussion around diversifying the AI compute stack is relevant and acknowledged within the broader ecosystem [User Query].

This internal discussion is mirrored and amplified by external industry movements. Notably, the "Beyond CUDA Summit," organized by TensorWave (a cloud provider utilizing AMD accelerators) and featuring prominent figures like computer architects Jim Keller and Raja Koduri, explicitly aimed to challenge NVIDIA's dominance. This event, strategically held near NVIDIA's GTC 2025, centered on dissecting the "CUDA moat" and exploring viable alternatives, underscoring a growing industry-wide desire for greater hardware flexibility, cost efficiency, and reduced vendor lock-in.

1.3 Report Objectives and Structure

This report aims to provide an expert-level analysis of the evolving AI compute landscape, moving beyond the CUDA-centric view. It will dissect the concept of the "CUDA moat," examine the strategies being employed to challenge NVIDIA's dominance, and detail the alternative hardware and software solutions emerging across the AI workflow – encompassing pre-training, post-training (optimization and fine-tuning), and inference.

The analysis will draw upon insights derived from the specified GTC video segment, synthesizing this information with data and perspectives gathered from recent industry reports, technical analyses, and market commentary found in the provided research materials. The report is structured into the following key sections:

Crossing the Moat: Deconstructing CUDA's competitive advantages and analyzing industry strategies for diversification.
Pre-training Beyond CUDA: Examining alternative hardware and software for large-scale model training.
Post-training Beyond CUDA: Investigating non-CUDA tools and techniques for model optimization and fine-tuning.
Inference Beyond CUDA: Detailing the diverse hardware and software solutions for deploying models outside the CUDA ecosystem.
Industry Outlook and Conclusion: Assessing the current market dynamics, adoption trends, and the future trajectory of AI compute heterogeneity.

2. Crossing the Moat: Understanding and Challenging CUDA's Dominance

2.1 Historical Context and the Rise of CUDA

NVIDIA's journey to AI dominance was significantly shaped by the strategic introduction of CUDA in 2006. This platform marked a pivotal shift, enabling developers to utilize the parallel processing power of NVIDIA GPUs for general-purpose computing tasks, extending their application far beyond traditional graphics rendering. NVIDIA recognized the potential of parallel computing on its hardware architecture early on, developing CUDA as a proprietary platform to unlock this capability. This foresight, driven partly by academic research demonstrating GPU potential for scientific computing and initiatives like the Brook streaming language developed by future CUDA creator Ian Buck , provided NVIDIA with a crucial first-mover advantage.

CUDA was designed with developers in mind, abstracting away much of the underlying hardware complexity and allowing researchers and engineers to focus more on algorithms and applications rather than intricate hardware nuances. It provided APIs, libraries, and tools within familiar programming paradigms (initially C/C++, later Fortran and Python). Over more than a decade, CUDA matured with relatively limited competition from viable, comprehensive alternatives. This extended period allowed the platform and its ecosystem to become deeply embedded in academic research, high-performance computing (HPC), and, most significantly, the burgeoning field of AI.

2.2 Deconstructing the "CUDA Moat": Ecosystem, Lock-in, and Performance

The term "CUDA moat" refers to the collection of sustainable competitive advantages that protect NVIDIA's dominant position in the AI compute market, primarily derived from its tightly integrated hardware and software ecosystem. This moat is multifaceted:

Ecosystem Breadth and Network Effects:

The CUDA ecosystem is vast, encompassing millions of developers worldwide, thousands of companies, and a rich collection of optimized libraries (e.g., cuDNN, cuBLAS, TensorRT), sophisticated development and profiling tools, extensive documentation, and strong community support.

CUDA is also heavily integrated into academic curricula, ensuring a steady stream of new talent proficient in NVIDIA's tools.

This widespread adoption creates powerful network effects: as more developers and applications utilize CUDA, more tools and resources are created for it, further increasing its value and reinforcing its position as the standard.
High Switching Costs and Developer Inertia:

Companies and research groups have invested heavily in developing, testing, and optimizing codebases built upon CUDA.

Migrating these complex workflows to alternative platforms like AMD's ROCm or Intel's oneAPI represents a daunting task. It often requires significant code rewriting, retraining developers on new tools and languages, and introduces substantial risks related to achieving comparable performance, stability, and correctness.

This "inherent inertia" within established software ecosystems creates high switching costs, making organizations deeply reluctant to abandon their CUDA investments, even if alternatives offer potential benefits.
Performance Optimization and Hardware Integration:

CUDA provides developers with low-level access to NVIDIA GPU hardware, enabling fine-grained optimization to extract maximum performance.

This is critical in compute-intensive AI workloads. The tight integration between CUDA software and NVIDIA hardware features, such as Tensor Cores (specialized units for matrix multiplication), allows for significant acceleration.

Competitors often struggle to match this level of performance tuning due to the deep co-design of NVIDIA's hardware and software.

While programming Tensor Cores directly can involve "arcane knowledge" and dealing with undocumented behaviors

, the availability of libraries like cuBLAS and CUTLASS abstracts some of this complexity.
Backward Compatibility:

NVIDIA has generally maintained backward compatibility for CUDA, allowing older code to run on newer GPU generations (though limitations exist, as newer CUDA versions require specific drivers and drop support for legacy hardware over time).

This perceived stability encourages long-term investment in the CUDA platform.
Vendor Lock-in:

The cumulative effect of this deep ecosystem, high switching costs, performance advantages on NVIDIA hardware, and established workflows results in significant vendor lock-in.

Developers and organizations become dependent on NVIDIA's proprietary platform, limiting hardware choices, potentially stifling competition, and giving NVIDIA considerable market power.

2.3 Industry Strategies for Diversification

Recognizing the challenges posed by the CUDA moat, various industry players are pursuing strategies to foster a more diverse and open AI compute ecosystem. These efforts span competitor platform development, the promotion of open standards and abstraction layers, and initiatives by large-scale users.

Competitor Platform Development:
- AMD ROCm (Radeon Open Compute):
  
  AMD's primary answer to CUDA is ROCm, an open-source software stack for GPU computing.
  
  Key to its strategy is the Heterogeneous-computing Interface for Portability (HIP), designed to be syntactically similar to CUDA, easing code migration.
  
  AMD provides the HIPIFY tool to automate the conversion of CUDA source code to HIP C++, although manual adjustments are often necessary.
  
  Despite progress, ROCm has faced significant challenges. Historically, it supported a limited range of AMD GPUs, suffered from stability issues and performance gaps compared to CUDA, and lagged in adopting new features and supporting the latest hardware.
  
  However, AMD continues to invest heavily in ROCm, improving framework support (e.g., native PyTorch integration
  
  ), expanding hardware compatibility (including consumer GPUs, albeit sometimes unofficially or with delays
  
  ), and achieving notable adoption for its Instinct MI300 series accelerators by major hyperscalers.
- Intel oneAPI:
  
  Intel promotes oneAPI as an open, unified, cross-architecture programming model based on industry standards, particularly SYCL (Data Parallel C++ or DPC++).
  
  It aims to provide portability across diverse hardware types, including CPUs, GPUs (Intel integrated and discrete), FPGAs, and other accelerators, explicitly positioning itself as an alternative to CUDA lock-in.
  
  oneAPI is backed by the Unified Acceleration (UXL) Foundation, involving multiple companies.
  
  While offering a promising vision for heterogeneity, oneAPI is a relatively newer initiative compared to CUDA and faces the challenge of building a comparable ecosystem and achieving widespread adoption.
- Other Initiatives:
  
  OpenCL, an earlier open standard for heterogeneous computing, remains relevant, particularly in mobile and embedded systems, but has struggled to gain traction in high-performance AI due to fragmentation, slow evolution, and performance limitations compared to CUDA.
  
  Vulkan Compute, leveraging the Vulkan graphics API, offers low-level control and potential performance benefits but has a steeper learning curve and a less mature ecosystem for general-purpose compute.
  
  Newer entrants like Modular Inc.'s Mojo programming language and MAX platform aim to combine Python's usability with C/CUDA performance, targeting AI hardware programmability directly.
Open Standards and Abstraction Layers:
- A significant trend involves leveraging higher-level AI frameworks like PyTorch, TensorFlow, and JAX, which can potentially abstract away underlying hardware specifics.
  
  If a model is written in PyTorch, the ideal scenario is that it can run efficiently on NVIDIA, AMD, or Intel hardware simply by targeting the appropriate backend (CUDA, ROCm, oneAPI/SYCL).
- The development of PyTorch 2.0, featuring TorchDynamo for graph capture and TorchInductor as a compiler backend, represents a move towards greater flexibility.
  
  TorchInductor can generate code for different backends, including using OpenAI Triton for GPUs or OpenMP/C++ for CPUs, potentially reducing direct dependence on CUDA libraries for certain operations.
- OpenAI Triton itself is positioned as a Python-like language and compiler for writing high-performance custom GPU kernels, aiming to achieve performance comparable to CUDA C++ but with significantly improved developer productivity.
  
  While currently focused on NVIDIA GPUs, its open-source nature holds potential for broader hardware support.
- OpenXLA (Accelerated Linear Algebra), originating from Google's XLA compiler used in TensorFlow and JAX, is another initiative focused on creating a compiler ecosystem that can target diverse hardware backends.
- However, these abstraction layers are not a panacea. The abstraction is often imperfect ("leaky"), many essential libraries within the framework ecosystems are still optimized primarily for CUDA or lack robust support for alternatives, performance parity is not guaranteed, and NVIDIA exerts considerable influence on the development roadmap of frameworks like PyTorch, potentially steering them in ways that favor CUDA.
  
  Achieving true first-class support for alternative backends within these dominant frameworks remains a critical challenge.
Hyperscaler Initiatives: The largest consumers of AI hardware – cloud hyperscalers like Google (TPUs), AWS (Trainium, Inferentia), Meta, and Microsoft – have the resources and motivation to develop their own custom AI silicon and potentially accompanying software stacks. This strategy allows them to optimize hardware for their specific workloads, control their supply chain, reduce costs, and crucially, avoid long-term dependence on NVIDIA. Their decisions to adopt competitor hardware (like AMD MI300X ) or build in-house solutions represent perhaps the most significant direct threat to the CUDA moat's long-term durability.
Direct Low-Level Programming (PTX): For organizations seeking maximum performance and control, bypassing CUDA entirely and programming directly in NVIDIA's assembly-like Parallel Thread Execution (PTX) language is an option, as demonstrated by DeepSeek AI. PTX acts as an intermediate representation between high-level CUDA code and the GPU's machine code. While this allows for fine-grained optimization potentially exceeding standard CUDA libraries, PTX is only partially documented, changes between GPU generations, and is even more tightly locked to NVIDIA hardware, making it a highly complex and specialized approach unsuitable for most developers.

2.4 Implications of the Competitive Landscape

The analysis of CUDA's dominance and the strategies to counter it reveals several key points about the competitive dynamics. Firstly, the resilience of NVIDIA's market position stems less from insurmountable technical superiority in every aspect and more from the profound inertia within the software ecosystem. The vast investment in CUDA codebases, developer skills, and tooling creates significant friction against adopting alternatives. This suggests that successful competitors need not only technically competent hardware but also a superior developer experience, seamless migration paths, robust framework integration, and compelling value propositions (e.g., cost, specific features) to overcome this inertia.

Secondly, abstraction layers like PyTorch and compilers like Triton present a complex scenario. While they hold the promise of hardware agnosticism, potentially weakening the direct CUDA lock-in, NVIDIA's deep integration and influence within these ecosystems mean they can also inadvertently reinforce the moat. The best-supported, highest-performing path often remains via CUDA. The ultimate impact of these layers depends critically on whether alternative hardware vendors can achieve true first-class citizenship and performance parity within them.

Thirdly, the "Beyond CUDA" movement suffers from fragmentation. The existence of multiple competing alternatives (ROCm, oneAPI, OpenCL, Vulkan Compute, Mojo, etc.) risks diluting development efforts and hindering the ability of any single alternative to achieve the critical mass needed to effectively challenge the unified CUDA front. This mirrors the historical challenges faced by OpenCL due to vendor fragmentation and lack of unified direction. Overcoming this may require market consolidation or the emergence of clear winners for specific niches.

Finally, the hyperscale cloud providers represent a powerful disruptive force. Their immense scale, financial resources, and strategic imperative to avoid vendor lock-in position them uniquely to alter the market dynamics. Their adoption of alternative hardware or the development of proprietary silicon and software stacks could create viable alternative ecosystems much faster than traditional hardware competitors acting alone.

Table 2.1: CUDA Moat Components and Counter-Strategies

Moat Component	NVIDIA's Advantage	Competitor Strategies	Key Challenges for Competitors
Ecosystem Size	Millions of developers, vast community, academic integration	Build communities around ROCm/oneAPI/Mojo; Leverage open-source framework communities (PyTorch, TF)	Reaching critical mass; Overcoming established network effects; Competing with NVIDIA's resources
Library Maturity	Highly optimized, extensive libraries (cuDNN, cuBLAS, TensorRT)	Develop competing libraries (ROCm libraries, oneAPI libraries); Contribute to framework-level ops	Achieving feature/performance parity; Ensuring stability and robustness; Breadth of domain coverage
Developer Familiarity	Decades of use, established workflows, available talent pool	Simplify APIs (e.g., HIP similarity to CUDA); Provide migration tools (HIPIFY, SYCLomatic); Focus on usability	Overcoming learning curves; Convincing developers of stability/benefits; Retraining workforce
Performance Optimization	Tight hardware-software co-design; Low-level access; Tensor Core integration	Optimize ROCm/oneAPI compilers; Improve framework backend performance; Develop specialized hardware	Matching NVIDIA's optimization level; Accessing/optimizing specialized hardware features (like Tensor Cores)
Switching Costs	High cost/risk of rewriting code, retraining, validating	Provide automated porting tools; Ensure framework compatibility; Offer significant cost/performance benefits	Imperfect porting tools; Ensuring functional equivalence and performance; Justifying the migration effort
Framework Integration	Deep integration & influence in PyTorch/TF; Optimized paths	Achieve native, high-performance support in frameworks; Leverage open-source contributions	Competing with NVIDIA's influence; Ensuring timely support for new framework features; Library dependencies
Hyperscaler Dependence	Major cloud providers are largest customers, rely on CUDA	Hyperscalers adopt AMD/Intel; Develop custom silicon/software; Promote open standards	Hyperscalers' internal efforts may not benefit broader market; Competing for hyperscaler design wins

3. Pre-training Beyond CUDA

3.1 Challenges in Pre-training

The pre-training phase for state-of-the-art AI models, particularly large language models (LLMs) and foundation models, involves computations at an immense scale. This process demands not only massive parallel processing capabilities but also exceptional stability and reliability over extended periods, often weeks or months. Historically, the maturity, performance, and robustness of NVIDIA's hardware coupled with the CUDA ecosystem made it the overwhelmingly preferred choice for these demanding tasks, establishing a high bar for any potential alternatives.

3.2 Alternative Hardware Accelerators

Despite NVIDIA's dominance, several alternative hardware platforms are being positioned and increasingly adopted for large-scale AI pre-training:

AMD Instinct Series (MI200, MI300X/MI325):

AMD's Instinct line, particularly the MI300 series, directly targets NVIDIA's high-end data center GPUs like the A100 and H100.

These accelerators offer competitive specifications, particularly in areas like memory capacity and bandwidth, which are critical for large models. They have gained traction with major hyperscalers, including Microsoft Azure, Oracle Cloud, and Meta, who see them as a viable alternative to reduce reliance on NVIDIA and potentially lower costs.

Cloud platforms like TensorWave are also building services based on AMD Instinct hardware.

AMD emphasizes a strategy centered around open standards and cost-effectiveness compared to NVIDIA's offerings.
Intel Gaudi Accelerators (Gaudi 2, Gaudi 3):

Intel's Gaudi family represents dedicated ASICs designed specifically for AI training and inference workloads.

Intel markets Gaudi accelerators, such as the recent Gaudi 3, as a significantly more cost-effective alternative to NVIDIA's flagship GPUs, aiming to capture a segment of the market prioritizing value.

Gaudi accelerators feature integrated high-speed networking (Ethernet), facilitating the construction of large training clusters.

It's noteworthy that deploying models on Gaudi often relies on Intel's specific SynapseAI software stack, which may differ from the broader oneAPI initiative in some aspects.
Google TPUs (Tensor Processing Units):

Developed in-house by Google, TPUs are custom ASICs highly optimized for TensorFlow and JAX workloads.

They have been instrumental in training many of Google's largest models and are available through Google Cloud Platform. TPUs demonstrate the potential of domain-specific architectures tailored explicitly for machine learning computations.
Other Emerging Architectures:

The landscape is further diversifying with other players. Amazon Web Services (AWS) offers its Trainium chips for training.

Reports suggest OpenAI and Microsoft may be developing their own custom AI accelerators.

Startups like Cerebras Systems (with wafer-scale engines) and Groq (focused on low-latency inference, but indicative of architectural innovation) are exploring novel designs.

Huawei also competes with its Ascend AI chips, particularly in the Chinese market, based on its Da Vinci architecture.

This proliferation of hardware underscores the intense interest and investment in finding alternatives or complements to NVIDIA's GPUs.

3.3 Software Stacks for Large-Scale Training

Hardware alone is insufficient; robust software stacks are essential to harness these accelerators for pre-training:

ROCm Ecosystem:

Training on AMD Instinct GPUs primarily relies on the ROCm software stack, particularly its integration with major AI frameworks like PyTorch and TensorFlow.

While functional and improving, the ROCm ecosystem's maturity, ease of use, breadth of library support, and performance consistency have historically been points of concern compared to the highly refined CUDA ecosystem.

Success hinges on continued improvements in ROCm's stability and performance within these critical frameworks.
oneAPI and Supporting Libraries:

Intel's oneAPI aims to provide the software foundation for training on its diverse hardware portfolio (CPUs, GPUs, Gaudi accelerators).

It utilizes DPC++ (based on SYCL) as the core language and includes libraries optimized for deep learning tasks, integrating with frameworks like PyTorch and TensorFlow.

The goal is a unified programming experience across different Intel architectures, simplifying development for heterogeneous environments.
Leveraging PyTorch/JAX/TensorFlow with Alternative Backends:

Regardless of the underlying hardware (AMD, Intel, Google TPU), the primary interface for most researchers and developers conducting large-scale pre-training remains high-level frameworks like PyTorch, JAX, or TensorFlow.

The viability of non-NVIDIA hardware for pre-training is therefore heavily dependent on the quality, performance, and completeness of the respective framework backends (e.g., PyTorch on ROCm, JAX on TPU, TensorFlow on oneAPI).
The Role of Compilers (Triton, XLA):

Compilers play a crucial role in bridging the gap between high-level framework code and low-level hardware execution. OpenAI Triton, used as a backend component within PyTorch 2.0's Inductor, translates Python-based operations into efficient GPU code (currently PTX for NVIDIA, but potentially adaptable).

Similarly, XLA optimizes and compiles TensorFlow and JAX graphs for various targets, including TPUs and GPUs.

The efficiency and target-awareness of these compilers are critical for achieving high performance on diverse hardware backends.
Emerging Languages/Platforms (Mojo):

New programming paradigms like Mojo are being developed with the explicit goal of providing a high-performance, Python-syntax-compatible language for programming heterogeneous AI hardware, including GPUs and accelerators from various vendors.

If successful, Mojo could offer a fundamentally different approach to AI software development, potentially bypassing some limitations of existing C++-based alternatives or framework-specific backends.
Direct PTX Programming (DeepSeek Example):

The case of DeepSeek AI utilizing PTX directly on NVIDIA H800 GPUs to achieve highly efficient training for their 671B parameter MoE model demonstrates an extreme optimization strategy.

By bypassing standard CUDA libraries and writing closer to the hardware's instruction set, they reportedly achieved significant efficiency gains.

This highlights that even within the NVIDIA ecosystem, CUDA itself may not represent the absolute performance ceiling for sophisticated users willing to tackle extreme complexity, though it remains far beyond the reach of typical development workflows.

3.4 Implications for Pre-training Beyond CUDA

The pre-training landscape, while still dominated by NVIDIA, is showing signs of diversification, driven by cost pressures and strategic initiatives from competitors and hyperscalers. However, several factors shape the trajectory. Firstly, the sheer computational scale of pre-training necessitates high-end, specialized hardware. This means the battleground for pre-training beyond CUDA is primarily contested among major silicon vendors (NVIDIA, AMD, Intel, Google) and potentially large hyperscalers with custom chip programs, rather than being open to a wide array of lower-end hardware.

Secondly, software maturity remains the most significant bottleneck for alternative hardware platforms in the pre-training domain. While hardware like AMD Instinct and Intel Gaudi offer compelling specifications and cost advantages , their corresponding software stacks (ROCm, oneAPI/SynapseAI) are consistently perceived as less mature, stable, or easy to deploy at scale compared to the battle-hardened CUDA ecosystem. For expensive, long-duration pre-training runs where failures can be catastrophic, the proven reliability of CUDA often outweighs the potential benefits of alternatives, hindering faster adoption.

Thirdly, the reliance on high-level frameworks like PyTorch and JAX makes robust backend integration paramount. Developers interact primarily through these frameworks, meaning the success of non-NVIDIA hardware hinges less on the intricacies of ROCm or SYCL syntax itself, and more on the seamlessness, performance, and feature completeness of the framework's support for that hardware. This elevates the strategic importance of compiler technologies like Triton and XLA, which are responsible for translating framework operations into efficient machine code for diverse targets. Vendors must ensure their hardware is a first-class citizen within these framework ecosystems to compete effectively in pre-training.

4. Post-training Beyond CUDA: Optimization and Fine-tuning

4.1 Importance of Post-training

Once a large AI model has been pre-trained, further steps are typically required before it can be effectively deployed in real-world applications. These post-training processes include optimization – techniques to reduce the model's size, decrease inference latency, and improve computational efficiency – and fine-tuning – adapting the general-purpose pre-trained model to perform well on specific downstream tasks or datasets. These stages often have different computational profiles and requirements compared to the massive scale of pre-training, potentially opening the door to a broader range of hardware and software solutions.

4.2 Techniques and Tools Outside the CUDA Ecosystem

Several techniques and toolkits facilitate post-training optimization and fine-tuning on non-NVIDIA hardware:

Model Quantization: Quantization is a widely used optimization technique that reduces the numerical precision of model weights and activations (e.g., from 32-bit floating-point (FP32) to 8-bit integer (INT8) or even lower). This significantly shrinks the model's memory footprint and often accelerates inference speed, particularly on hardware with specialized support for lower-precision arithmetic.
- OpenVINO NNCF:
  
  Intel's OpenVINO toolkit includes the Neural Network Compression Framework (NNCF), a Python package offering various optimization algorithms.
  
  NNCF supports post-training quantization (PTQ), which optimizes a model after training without requiring retraining, making it relatively easy to apply but potentially causing some accuracy degradation.
  
  It also supports quantization-aware training (QAT), which incorporates quantization into the training or fine-tuning process itself, typically yielding better accuracy than PTQ at the cost of requiring additional training data and computation.
  
  NNCF can process models from various formats (OpenVINO IR, PyTorch, ONNX, TensorFlow) and targets deployment on Intel hardware (CPUs, integrated GPUs, discrete GPUs, VPUs) via the OpenVINO runtime.
- Other Approaches:
  
  While less explicitly detailed for ROCm or oneAPI in the provided materials, quantization capabilities are often integrated within AI frameworks themselves or through supporting libraries. The BitsandBytes library, known for enabling quantization techniques like QLoRA, recently added experimental multi-backend support, potentially enabling its use on AMD and Intel GPUs beyond CUDA.
  
  Frameworks running on ROCm or oneAPI backends might leverage underlying hardware support for lower precisions.
Pruning and Compression: Model pruning involves removing redundant weights or connections within the neural network to reduce its size and computational cost. NNCF also provides methods for structured and unstructured pruning, which can be applied during training or fine-tuning.
Fine-tuning Frameworks on ROCm/oneAPI: Fine-tuning typically utilizes the same high-level AI frameworks employed during pre-training, such as PyTorch, TensorFlow, or JAX, along with libraries like Hugging Face Transformers and PEFT (Parameter-Efficient Fine-Tuning).
- ROCm Example:
  
  The process of fine-tuning LLMs using techniques like LoRA (Low-Rank Adaptation) on AMD GPUs via ROCm is documented.
  
  Examples demonstrate using PyTorch, the Hugging Face
```
transformers
```
  library, and
```
peft
```
  with the
```
SFTTrainer
```
  on ROCm-supported hardware, indicating that standard parameter-efficient fine-tuning workflows can be executed within the ROCm ecosystem.
- Intel Platforms:
  
  Fine-tuning can also be performed on Intel hardware, such as Gaudi accelerators
  
  or potentially GPUs supported by oneAPI, leveraging the respective framework integrations.
  
  The choice of hardware depends on the scale of the fine-tuning task.
Role of Hugging Face Optimum: Libraries like Hugging Face Optimum, particularly Optimum Intel, play a crucial role in simplifying the post-training workflow. Optimum Intel integrates OpenVINO and NNCF capabilities directly into the Hugging Face ecosystem, allowing users to easily optimize and quantize models from the Hugging Face Hub for deployment on Intel hardware. This integration streamlines the process for developers already working within the popular Hugging Face environment.

4.3 Hardware Considerations for Efficient Post-training

Unlike pre-training, which often necessitates clusters of the most powerful and expensive accelerators, fine-tuning and optimization tasks can sometimes be accomplished effectively on a wider range of hardware. Depending on the size of the model being fine-tuned and the specific task, single high-end GPUs (including professional or even consumer-grade NVIDIA or AMD cards ), Intel Gaudi accelerators , or potentially even powerful multi-core CPUs might suffice. This broader hardware compatibility increases the potential applicability of non-NVIDIA solutions in the post-training phase.

4.4 Implications for Post-training Beyond CUDA

The post-training stage presents distinct opportunities and challenges for CUDA alternatives. A key observation is the apparent strength of Intel's OpenVINO ecosystem in the optimization domain. The detailed documentation and tooling around NNCF for quantization and pruning provide a relatively mature pathway for optimizing models specifically for Intel's diverse hardware portfolio (CPU, iGPU, dGPU, VPU). This specialized toolkit gives Intel a potential advantage over AMD in this specific phase, as ROCm's dedicated optimization tooling appears less emphasized in the provided research beyond its core framework support.

Furthermore, the success of fine-tuning on alternative platforms like ROCm hinges critically on the robustness and feature completeness of the framework backends. As demonstrated by the LoRA example on ROCm, fine-tuning workflows rely directly on the stability and capabilities of the PyTorch (or other framework) implementation for that specific hardware. Any deficiencies in the ROCm or oneAPI backends will directly impede efficient fine-tuning, reinforcing the idea that mature software support is as crucial as raw hardware power.

Finally, there is a clear trend towards integrating optimization techniques directly into higher-level frameworks and libraries, exemplified by Hugging Face Optimum Intel. This suggests that developers may increasingly prefer using these integrated tools within their familiar framework environments rather than engaging with standalone, vendor-specific optimization toolkits. This trend further underscores the strategic importance for hardware vendors to ensure seamless and performant integration of their platforms and optimization capabilities within the dominant AI frameworks.

Table 4.1: Non-CUDA Model Optimization & Fine-tuning Tools

Tool/Platform	Key Techniques	Target Hardware	Supported Frameworks/Formats	Ease of Use/Maturity (Qualitative)
OpenVINO NNCF	PTQ, QAT, Pruning (Structured/Unstructured)	Intel CPU, iGPU, dGPU, VPU	OpenVINO IR, PyTorch, TF, ONNX	Relatively mature and well-documented for Intel ecosystem; Integrated with HF Optimum Intel
ROCm + PyTorch/PEFT	Fine-tuning (e.g., LoRA, Full FT)	AMD GPUs (Instinct, Radeon)	PyTorch, HF Transformers	Relies on ROCm backend maturity for PyTorch; Examples exist, but ecosystem maturity concerns remain
oneAPI Libraries	Likely includes optimization/quantization libraries (details limited in snippets)	Intel CPU, GPU, Gaudi	PyTorch, TF (via framework integration)	Aims for unified model, but specific optimization tool maturity less clear from snippets compared to NNCF
BitsandBytes (Multi-backend)	Quantization (e.g., for QLoRA)	NVIDIA, AMD, Intel (Experimental)	PyTorch	Experimental support for non-NVIDIA; Requires specific installation/compilation
Intel Gaudi + SynapseAI	Fine-tuning	Intel Gaudi Accelerators	PyTorch, TF (via SynapseAI)	Specific stack for Gaudi; Maturity relative to CUDA less established

5. Inference Beyond CUDA

5.1 The Inference Landscape: Diversity and Optimization

The inference stage, where trained and optimized models are deployed to make predictions on new data, presents a significantly different set of requirements compared to training. While training often prioritizes raw throughput and the ability to handle massive datasets and models, inference deployment frequently emphasizes low latency, high throughput for concurrent requests, cost-effectiveness, and power efficiency. This diverse set of optimization goals leads to a wider variety of hardware platforms and software solutions being employed for inference, creating more opportunities for non-NVIDIA technologies.

5.2 Diverse Hardware for Deployment

The hardware landscape for AI inference is notably heterogeneous:

CPUs & Integrated GPUs (Intel):

Standard CPUs and the integrated GPUs found in many systems (particularly from Intel) are common inference targets, especially when cost and accessibility are key factors. Toolkits like Intel's OpenVINO are specifically designed to optimize model execution on this widely available hardware.
Dedicated Inference Chips (ASICs):

Application-Specific Integrated Circuits (ASICs) designed explicitly for inference offer high performance and power efficiency for specific types of neural network operations. Examples include AWS Inferentia

and Google TPUs (which are also used for inference).
FPGAs (Field-Programmable Gate Arrays):

FPGAs offer hardware reprogrammability, providing flexibility and potentially very low latency for certain inference tasks. They can be adapted to specific model architectures and evolving requirements.
Edge Devices & NPUs:

The proliferation of AI at the edge (in devices like smartphones, cameras, vehicles, and IoT sensors) drives demand for efficient inference on resource-constrained hardware.

This often involves specialized Neural Processing Units (NPUs) or optimized software running on low-power CPUs or GPUs. Intel's Movidius Vision Processing Units (VPUs), accessible via OpenVINO, are an example of such edge-focused hardware.
AMD/Intel Data Center & Consumer GPUs:

Data center GPUs from AMD (Instinct series) and Intel (Data Center GPU Max Series), as well as consumer-grade GPUs (AMD Radeon, Intel Arc), are also viable platforms for inference workloads.

Software support comes via ROCm, oneAPI, or cross-platform runtimes like OpenVINO and ONNX Runtime.

5.3 Software Frameworks and Inference Servers

Deploying models efficiently requires specialized software frameworks and servers:

OpenVINO Toolkit & Model Server:

Intel's OpenVINO plays a significant role in the non-CUDA inference space. It provides tools (like NNCF) to optimize models trained in various frameworks and a runtime engine to execute these optimized models efficiently across Intel's hardware portfolio (CPU, iGPU, dGPU, VPU).

OpenVINO also integrates with ONNX Runtime as an execution provider

and potentially offers its own Model Server for deployment.

While some commentary questions its popularity relative to alternatives like Triton

, it provides a clear path for inference on Intel hardware.
ROCm Inference Libraries (MIGraphX):

AMD provides inference optimization libraries within the ROCm ecosystem, such as MIGraphX. These likely function as compilation targets or backends for higher-level frameworks or standardized runtimes like ONNX Runtime when deploying on AMD GPUs.
ONNX Runtime:

The Open Neural Network Exchange (ONNX) format and its corresponding ONNX Runtime engine are crucial enablers of cross-platform inference. ONNX Runtime acts as an abstraction layer, allowing models trained in frameworks like PyTorch or TensorFlow and exported to the ONNX format to be executed on a wide variety of hardware backends through its Execution Provider (EP) interface.

Supported EPs include CUDA, TensorRT (NVIDIA), OpenVINO (Intel), ROCm (AMD), DirectML (Windows), CPU, and others.

This significantly enhances model portability beyond the confines of a single vendor's ecosystem.
NVIDIA Triton Inference Server:

While developed by NVIDIA, Triton is an open-source inference server designed for flexibility.

It supports multiple model formats (TensorRT, TensorFlow, PyTorch, ONNX) and backends (including OpenVINO, Python custom backends, ONNX Runtime).

This architecture theoretically allows Triton to serve models using non-CUDA backends if appropriately configured.

There is active discussion and development work on enabling backends like ROCm (via ONNX Runtime) for Triton

, which could further position it as a more hardware-agnostic serving solution. However, its primary adoption and optimization focus remain heavily associated with NVIDIA GPUs.
Alternatives/Complements to NVIDIA Triton:

The inference serving landscape includes several other solutions. vLLM has emerged as a highly optimized library specifically for LLM inference, utilizing techniques like PagedAttention and Continuous Batching, and reportedly offering better throughput and latency than Triton in some LLM scenarios.

Other options include Kubernetes-native solutions like KServe (formerly KFServing), framework-specific servers like TensorFlow Serving and TorchServe, and integrated cloud provider platforms such as Amazon SageMaker Inference Endpoints

and Google Vertex AI Prediction.

The choice often depends on the specific model type (e.g., LLM vs. vision), performance requirements, scalability needs, and existing infrastructure.
DirectML (Microsoft):

For Windows environments, DirectML provides a hardware-accelerated API for machine learning that leverages DirectX 12. It can be accessed via ONNX Runtime or other frameworks and supports hardware from multiple vendors, including Intel and AMD, offering another path for non-CUDA acceleration on Windows.

5.4 Implications for Inference Beyond CUDA

The inference stage represents the most fragmented and diverse part of the AI workflow, offering the most significant immediate opportunities for solutions beyond CUDA. The varied hardware targets and optimization priorities (cost, power, latency) create numerous niches where NVIDIA's high-performance, CUDA-centric approach may not be the optimal or only solution. Toolkits explicitly designed for heterogeneity, like OpenVINO and ONNX Runtime, are pivotal in enabling this diversification.

OpenVINO, in particular, provides a mature and well-defined pathway for optimizing and deploying models efficiently on the vast installed base of Intel CPUs and integrated graphics, making AI inference accessible without requiring specialized accelerators. ONNX Runtime acts as a crucial interoperability layer, effectively serving as a universal translator that allows models developed in one framework to run on hardware supported by another vendor's backend (ROCm, OpenVINO, DirectML, etc.). The adoption and continued development of these two technologies significantly lower the barrier for deploying models outside the traditional CUDA/TensorRT stack.

While NVIDIA's Triton Inference Server is powerful and widely used, its potential as a truly hardware-agnostic server remains partially realized. Although its architecture supports multiple backends, including non-CUDA ones like OpenVINO and ONNX Runtime , its primary association, optimization efforts, and community focus are still heavily centered around NVIDIA GPUs and the TensorRT backend. The active exploration of alternatives like vLLM for specific workloads (LLMs) and the ongoing efforts to add robust support for other backends like ROCm suggest that the market perceives a need for solutions beyond what Triton currently offers optimally for non-NVIDIA or highly specialized use cases.

Table 5.1: Inference Solutions Beyond CUDA

Solution (Hardware + Software Stack/Server)	Target Use Case	Key Features/Optimizations	Framework/Format Compatibility	Relative Performance/Cost Indicator (Qualitative)
Intel CPU/iGPU + OpenVINO	Edge, Client, Cost-sensitive Cloud	PTQ/QAT (NNCF), Latency/Throughput modes, Auto-batching, Optimized for Intel Arch	OpenVINO IR, ONNX, TF, PyTorch	Lower cost, wide availability; Performance depends heavily on CPU/iGPU generation and optimization
AMD GPU + ROCm / ONNX Runtime	Cloud, Workstation Inference	MIGraphX optimization, HIP, ONNX Runtime ROCm EP	ONNX, PyTorch, TF (via ROCm backend)	Potential cost savings vs NVIDIA; Performance dependent on GPU tier and ROCm maturity
Intel dGPU/VPU + OpenVINO	Edge AI, Visual Inference	Optimized for Intel dGPU/VPU hardware, Leverages NNCF	OpenVINO IR, ONNX, TF, PyTorch	Power-efficient options for edge; Performance competitive in target niches
AWS Inferentia + Neuron SDK	Cloud Inference (AWS)	ASIC optimized for inference, Low cost per inference, Neuron SDK compiler	TF, PyTorch, MXNet, ONNX	High throughput, low cost on AWS; Limited to AWS environment
Generic CPU/GPU + ONNX Runtime	Cross-platform deployment	Hardware abstraction via Execution Providers (CPU, OpenVINO, ROCm, DirectML, etc.)	ONNX (from TF, PyTorch, etc.)	Highly portable; Performance varies significantly based on chosen EP and underlying hardware
NVIDIA/AMD GPU + vLLM	High-throughput LLM Inference	PagedAttention, Continuous Batching, Optimized Kernels	PyTorch (HF Models)	Potentially higher LLM throughput/lower latency than Triton in some cases; Primarily GPU-focused
FPGA + Custom Runtime	Ultra-low latency, Specialized tasks	Hardware reconfigurability, Optimized data paths	Custom / Specific formats	Very low latency possible; Higher development effort, niche applications
Windows Hardware + DirectML / ONNX Runtime	Windows-based applications	Hardware acceleration via DirectX 12 API, Supports Intel/AMD/NVIDIA	ONNX, Frameworks with DirectML support	Leverages existing Windows hardware acceleration; Performance varies with GPU

6. Industry Outlook and Conclusion

6.1 Market Snapshot: Current Share and Growth Trends

The AI hardware market, particularly for data center compute, remains heavily dominated by NVIDIA. Current estimates place NVIDIA's market share for AI accelerators and data center GPUs in the 80% to 92% range. Despite this dominance, competitors are present and making some inroads. AMD has seen its data center GPU share grow slightly, reaching approximately 4% in 2024, driven by adoption from major cloud providers. Other players like Huawei hold smaller shares (around 2% ), and Intel aims to capture market segments with its Gaudi accelerators and broader oneAPI strategy.

The overall market is experiencing explosive growth. Projections for the AI server hardware market suggest growth from around $157 billion in 2024 to potentially trillions by the early 2030s, with a compound annual growth rate (CAGR) estimated around 38%. Similarly, the AI data center market is projected to grow from roughly $14 billion in 2024 at a CAGR of over 28% through 2030. The broader AI chip market is forecast to surpass $300 billion by 2030. Within these markets, GPUs remain the dominant hardware component for AI , inference workloads constitute the largest function segment , cloud deployment leads over on-premises , and North America is the largest geographical market.

6.2 Adoption Progress and Remaining Hurdles for CUDA Alternatives

Significant efforts are underway to build viable alternatives to the CUDA ecosystem. AMD's ROCm has matured, gaining crucial support within PyTorch and securing design wins with hyperscalers for its Instinct accelerators. Intel's oneAPI offers a comprehensive vision for heterogeneous computing backed by the UXL Foundation, and its OpenVINO toolkit provides a strong solution for inference optimization and deployment on Intel hardware. Abstraction layers and compilers like PyTorch 2.0, OpenAI Triton, and OpenXLA are evolving to provide more hardware flexibility.

Despite this progress, substantial hurdles remain for widespread adoption of CUDA alternatives. The primary challenge continues to be the maturity, stability, performance consistency, and breadth of the software ecosystems compared to CUDA. Developers often face a steeper learning curve, more complex debugging, and potential performance gaps when moving away from the well-trodden CUDA path. The sheer inertia of existing CUDA codebases and developer familiarity creates significant resistance to change. Furthermore, the alternative landscape is fragmented, lacking a single, unified competitor to CUDA, which can dilute efforts and slow adoption. While the high cost of NVIDIA hardware is a strong motivator for exploring alternatives , these software and ecosystem challenges often temper the speed of transition, especially for mission-critical training workloads.

6.3 The Future Trajectory: Towards a More Heterogeneous AI Compute Landscape?

The future of AI compute appears poised for increased heterogeneity, although the pace and extent of this shift remain subject to competing forces. On one hand, NVIDIA continues to innovate aggressively, launching new architectures like Blackwell, expanding its CUDA-X libraries, and building comprehensive platforms like DGX systems and NVIDIA AI Enterprise. Its deep ecosystem integration and performance leadership, particularly in high-end training, provide a strong defense for its market share.

On the other hand, the industry push towards openness, cost reduction, and strategic diversification is undeniable. Events like the Beyond CUDA Summit , initiatives like the AI Alliance (including AMD, Intel, Meta, etc. ), the UXL Foundation , and the significant investments by hyperscalers in custom silicon or alternative suppliers all signal a concerted effort to reduce dependence on NVIDIA's proprietary stack. Geopolitical factors and supply chain vulnerabilities, particularly the heavy reliance on TSMC for cutting-edge chip manufacturing, also represent potential risks for NVIDIA's long-term dominance and could further incentivize diversification.

The most likely trajectory involves a gradual diversification, particularly noticeable in the inference space where hardware requirements are more varied and cost/power efficiency are paramount. Toolkits like OpenVINO and runtimes like ONNX Runtime will continue to facilitate deployment on non-NVIDIA hardware. In training, while NVIDIA is expected to retain its lead in the highest-performance segments in the near term, competitors like AMD and Intel will likely continue to gain share, especially among cost-sensitive enterprises and hyperscalers actively seeking alternatives. The success of emerging programming models like Mojo could also influence the landscape if they gain significant traction.

6.4 Concluding Remarks on the Viability and Impact of the "Beyond CUDA" Movement

Synthesizing the analysis of the GTC video's focus on compute beyond CUDA and the broader industry research, it is clear that NVIDIA's CUDA moat remains formidable. Its strength lies not just in performant hardware but, more critically, in the deeply entrenched software ecosystem, developer inertia, and high switching costs accumulated over nearly two decades. Overcoming this requires more than just competitive silicon; it demands mature, stable, easy-to-use software stacks, seamless integration with dominant AI frameworks, and compelling value propositions.

However, the "Beyond CUDA" movement is not merely aspirational; it is a tangible trend driven by significant investment, strategic necessity, and a growing ecosystem of alternatives. Progress is evident across hardware (AMD Instinct, Intel Gaudi, TPUs, custom silicon) and software (ROCm, oneAPI, OpenVINO, Triton, PyTorch 2.0, ONNX Runtime). While a complete upheaval of CUDA's dominance appears unlikely in the immediate future, the landscape is undeniably shifting towards greater heterogeneity. Inference deployment is diversifying rapidly, and competition in the training space is intensifying. The ultimate pace and extent of this transition will depend crucially on the continued maturation and convergence of alternative software ecosystems and their ability to provide a developer experience and performance level that can effectively challenge the CUDA incumbency. The coming years will be critical in determining whether the industry successfully cultivates a truly multi-vendor, open, and competitive AI compute environment.

August 19, 2024December 31, 2024

2024 年终总结

各位朋友，大家好。时间过得非常快，转眼又到年终岁末，想借这个机会做一个比较详尽、回顾式的总结，也为自己接下来的学习和工作做一个展望。接下来这段总结可能会稍微长一点，大概会花上十分钟的时间跟大家分享一下我今年的经历、感悟以及对未来的期待，希望大家能耐心听我唠一唠。

一、身体与心态的转折

今年的五到十二月，对我来说是一段并不算轻松的时光。由于生病的关系，我被迫中断了不少科研与学习的进度，也暂时离开了我非常热爱的技术领域。在那段日子里，身体上的不适与心灵上的焦虑常常相互影响，一方面我担心康复是否能够顺利，另一方面也忧虑自己是否会因此错过技术社区的更新迭代，以及可能涌现的新机会。

然而，这段“被迫停下脚步”的时间也给了我一个宝贵的缓冲期，让我能够更加冷静地思考：到底自己为什么会对 eBPF、WASM、CXL 等方向投入这么多的热情？到底想在未来的科研或工业实践中扮演怎样的角色？我慢慢意识到，保持对世界的好奇、享受做科研和技术探索的乐趣，其实才是我内心深处最坚定的东西。这也让我对自己今后的人生布局有了更清晰的感受和方向——求真、探索和创造比起一时的成就和荣誉似乎更重要。

二、回顾与反思：从好奇心出发

我一直觉得，一个人能够持续走得远，最大的动力往往是源自内心深处的好奇心。在这一年里，虽然有一段时间我无法亲身投入各种项目和话题的第一线，但我一直在关注产业和学术界一些新的动态，也不时阅读各种新闻、博客和技术更新，让我在“被迫停下来”的期间依旧保持对行业的敏感度。

eBPF（Extended Berkeley Packet Filter）
之所以对 eBPF 感兴趣，是因为它为内核态与用户态之间的灵活沟通提供了前所未有的机会。在内核层面，对网络、跟踪、可观测性、安全策略等的精细控制，让我看到操作系统领域新的可能性。尽管它背后的概念已经出现多年，但近几年的蓬勃发展证明了它在云原生时代和超大规模分布式系统中仍有很大的增长空间。
WASM（WebAssembly）
WebAssembly 近几年从前端扩展到后端，甚至已经进入云计算和边缘计算的范畴。它高度跨平台、可移植、高效的特性，让我看到未来在容器化和函数计算层面的更多机会。我一直好奇 WASM 是否会在云端或边缘带来类似当年 Docker、Kubernetes 带来的生态变革，毕竟能够在各种硬件、各种语言环境下运行的可移植性，对整个行业来说是非常有吸引力的。
CXL（Compute Express Link）
这是一个让存储与计算之间实现更紧密交互的新兴技术标准，尤其适用于数据中心、云计算以及对计算与存储性能要求极高的场景。CXL 提供了一种共享和统一管理内存或加速器资源的方式，让我对未来服务器架构和高性能计算的演进充满好奇。我想，如果 eBPF 和 WASM 关注的是“软件层面的革新”，那么 CXL 则是硬件层面的重大突破。有时候，把眼光放到系统堆栈的更底层，也能带来全新的灵感和思考。

这些领域都各自处于技术发展的前沿，也都有各自的挑战。我很喜欢这种“万花筒”般的世界：变化既是挑战，也是机遇，旧的知识会逐渐被新思路取代；新人只要肯投入时间和精力，就能找到可以施展的空间。这种迭代的活力深深地吸引着我，也让我一直想要在学习或工作中关注更广、更深的领域。

三、拥抱变化：科研、工业和开源

在回顾这一年、反思自身状态的时候，我发现自己心中始终有一个纠结：究竟是继续在学术界深耕，比如考虑读博士、做研究；还是去工业界寻找更直接的项目落地场景？亦或是投身开源社区，在一个大家庭里共同打磨一款产品？

其实，这三种方向并不一定相互排斥。正如许多成功的前沿技术往往源自学术研究，再经过开源社区的孵化，最后才在工业界大规模应用。我看到 eBPF、WASM、CXL 也在走着类似的路径：先由少数极客在学术或社区环境中推波助澜，逐渐积累，形成一定规模，再被大企业收购或引入到正式的生产环境。我个人非常希望能在这个过程中尽一份力，从早期就参与其中，并在学术和工业应用上都能获得成长与磨砺。

1. 学术探索

我对学术研究的热爱，更多是对于“未知领域的好奇”。学术研究可以让我沉下心来思考：技术背后更底层、更本质的问题在哪里？这样的问题往往并非一朝一夕就能解决，需要的是长时间的积累和深度思考。同时，研究的过程也能让我对事物有更全面和系统的认识，哪怕只是在科研群体的圈子里互相切磋，也能激发创新火花。

2. 工业场景

工业界的实践能带给我更多“真刀真枪”的挑战。如何解决海量数据带来的网络瓶颈与安全风险？如何提升系统性能、减少资源浪费？这些问题没有现成的标准答案，需要我们从需求出发，结合实际的工作负载和使用场景去做权衡。eBPF、WASM 和 CXL 在这一领域都大有可为：前者能在内核层实现灵活的策略定义与性能分析，后者可为跨平台运行与硬件加速提供新的方案。

3. 开源社区

我相信开源社区代表了一种更加开放、自由的创新方式。只要感兴趣，任何人都可以在全球范围内基于同一个代码库、文档、工具集做出贡献和探索。像 Cilium（基于 eBPF 的网络与安全），这些开源项目已经有了可观的商业价值，也吸引了许多开发者。我非常向往这种“你一行代码、我一行代码，大家一起把未来拼出来”的氛围，既能让技术传播更广，也能帮助个人在更短的时间里融入国际化的技术浪潮。

四、展望未来：持续好奇、持续前行

经历了生病的休整和对未来的沉淀思考，我在下半年重新回到科研与实践当中。我将主要从以下几个方面着手，继续深化与拓展：

深化对 eBPF 的理解与研究
我会结合具体的场景，比如网络安全、可观测性、微服务治理等，探究 eBPF 在极端规模（大流量或复杂网络拓扑）下的优势和挑战。并且尝试与其他新兴技术结合，看是否能拓展更多的应用形态。
探索 WASM 在云端与边缘的应用
我想亲手实验一些基于 WASM 的函数计算平台，把它们与容器技术进行横向对比，看看不同的部署环境、不同的语言绑定，会给性能、扩展性和可维护性带来怎样的差异，也希望在这个过程中能摸索出更优的实践方案。
关注 CXL 对新一代数据中心与 HPC 的影响
虽然对硬件层面还不算太熟悉，但我会持续关注社区和行业的一些新动向，尽可能在底层架构优化或资源管理的角度上学习 CXL 能带来的新机遇。如果有机会，也希望与志同道合的团队一起做一些实验性的项目，验证 CXL 在实际应用中的价值和限制。
加强学术与工业界的连接
我会继续思考是否要在学术研究上投入更多时间，比如考虑读博士，或者在高校、研究机构里担任研究助理；同时也可能会到工业界或者开源社区参与实习或短期项目。这些都是为了能够将理论与实践更好地结合，让研究问题与真实需求互相促进、共同进步。
保持对身体健康的关注
病痛教会我，只有保持健康，我们才能够持续地投入到我们热爱的事情上。所以，在忙碌的科研与工作之外，我也会花更多的时间休息、运动、调节心态，让自己在追逐梦想的同时，不至于因为身体等因素而再次中断。

五、结语

回顾这一年，虽然从五月到十二月有相当长的一段时间被疾病困扰，但我觉得自己反而从中得到了更多关于人生与未来的启示。我更加明确：我真正的热爱与动力，还是源自对世界的好奇，对技术与创新的热情。无论未来做学术、做工业还是做开源，乃至于探索更多可能性，最重要的是不要失去那份“想玩的开心、想探索未知”的初心。

感谢大家在这一年里对我的关注、陪伴和帮助，也谢谢每一位给我提供技术指导和生活支持的朋友们。明年，希望我可以在 eBPF、WASM、CXL 等方向上取得更多进步，也期待能与更多有同样好奇心和热情的小伙伴们携手前行，在这个快速变化的时代，一起去创造新的可能性。

祝愿大家在新的一年身体健康、万事顺遂，也祝我们每个人都能继续保持对世界的好奇和探索的勇气。让我们一起迎接下一个更具挑战与机会的2025年！

谢谢大家！

July 26, 2024July 29, 2024

MOAT: Towards Safe BPF Kernel Extension

MPK only supports up to 16 domains, the # BPF could be way over this number. We use a 2-layer isolation scheme to support unlimited BPF programs. The first layer deploys MPK to set up a lightweight isolation between the kernel and BPF programs. Also, BPF helper function calls is not protected, and can be attacked.

They use the 2 layer isolation with PCID. In the first layer, BPF Domain has protection key permission lifted by kernel to do corresponding work, only exception is GDT and IDT they are always write-disabled. The second layer, when a malicious BPF program tries to access the memory regions of another BPF program, a page fault occurs, and the malicious BPF program is immediately terminated. To avoid TLB flush, each BPF program has PCID and rarely overflow 4096 entries.

helper: 1. protect sensitive objects It has critical object finer granularity protection to secure. 2. ensure the validity of the parameters. It(Dynamic Parameter Auditing (DPA)) leverages the information obtained from the BPF verifier to dynamically check if the parameters are within their legitimate ranges.

July 25, 2024July 28, 2024

LibPreemptible

uintr come with Sapphire Rapids,(RISCV introduce N extension at 2019) meaning no context switches compared with signal, providing lowest IPC Latency. Using APIC will incur safety concern.

uintr usage

general purpose IPC
userspace scheduler(This paper)
userspace network
libevent & liburing

syscall addication(eventfd like) sender initiate and notify the event and receiver get the fd call into kernel and senduipi back to sender.

WPS Office 2024-07-27 20.44.35
They wrote a lightweight runtime for libpreemptible.

enable lightweight and fine grained preemption
Separation of mechanism and policy
Scability
Compatibility

They maintained a fine-grained(3us) and dynamic timers for scheduling rather than kernel timers. It can greatly improve the 99% tail latency. Normal design of SPR's hw feature.

Reference

https://github.com/OS-F-4/qemu-tutorial/blob/master/qemu-tutorial.md

July 22, 2024July 28, 2024

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

This paper talks about Message Passing Interface (MPI) libraries utilize CXL for inter-node communication.

In the HH case, CXL has lower latency than Ethernet for the small message range with 9.5x speedup. As the message size increases, the trend reverses with Ethernet performing better in latency than CXL due to the CXL channel having lower bandwidth than Ethernet in the emulated system 2 compute node with memory expander for each node.

June 23, 2024July 28, 2024

Apta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS

Use hw sw codesign object coherency rather than CXL3.0 cacheline level coherency.

June 23, 2024July 16, 2024

Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

This paper proposed a low latency general NDP offloading architecture $M^2NDP$. It has memory mapped function $M^2func$ and memory mapped uthreading $M^2\micro thr$

May 5, 2024July 15, 2024

有关最近的放慢脚步，情景重现，灵异事件以及背后的哲学

放慢脚步

在电击以后我得要吃一种嗜睡的药，导致我现在为懒惰找到了合理的借口。我从2024/4/9日到5/21回到上海的家里为止，大约感觉自己从鬼门关走过了3次。我的神经系统在病不发作的时候是正常的，但是一旦发作就有很强的濒死感。这种生活的重伤打击到了我对于未来的预判，我在住院的时候就在想，我还没有写完代码留给这个世界一些关于yyw的印记，怎么能这么快就走了呢？我由于嗜睡而无法做科研的时候就会想我是一个目标感很强的人，怎么能死在半路上呢？只不过，我也能放慢脚步知道有些东西急不得，也得不到，只能在潦倒的时候检查自己哪里有什么不对，哪里还可以提升，只为下一次的美好保存实力。虽然这个下一次很可能永远都出现不了了。（悲伤ing）

永远不要窥探他人的人生

人和人之间的节奏差距很大，我觉得读博就是一个精神折磨，博士之间的比较没有意义。尤其是不要看着别人有就急功近利。

我现在从坚定的唯物主义者到唯心主义者。

因为唯物者的心灵也是一种观测角度，而神创造了让唯物者可以信服的一切，在某个尺度上让唯物者是唯物者。但是这个世界还是有很多没法解释的事情，比如我被电击同时有幻觉幻听，以及我的神经系统可以感受到部分的疼痛，可是过一段时间又能控制身体的某一个器官。这方面现在的科学还没办法解释。

灵异事件

我自从大脑被电击了以后，脑子首先出现了我做出不受理性控制的事情，以为自己做了没有做的事情，失去记忆慢慢倒带找回的情况。从那次以后，脑子有一个部分到心里去了。碰到激动的事情，比如我喜欢的Formal Methods、网球以及体系结构，就会大脑兴奋，从而感觉到幻听，接着faint。有一次吃饭，我的大脑在一次心脑相连后失去了近两年的记忆，甚至失去了我mtf部分的记忆。在一次和老婆经过wharf的过程中，我慢慢地找回了自己的记忆，那些记忆碎片是从身体的不同位置慢慢回到大脑里的，我的记忆在那个时候只有很短暂的，像是金鱼脑子。我会不停的问我老婆我恢复了没，我为什么会在这里。

大脑的不同部分的功能不一样

在被电击后，我首先是感觉到了大脑中的神经分散到了身体的不同位置，每一部分都带有我人生的部分记忆和，最后一片进入了我的心脏里，导致之后有很多次心脑相连的症状。我可以明显感觉到脑子的一部分是女性的，也有男性的部分。每次心脑相连以后，首先感觉到身体里有三种声音，有一个幻听的声音是决绝且严厉的。有另一个个只讲实话的潜意识。还有另一个潜意识是及其温柔且想象力丰富的。接着大脑会像液体一样裹着心脏流动在身体的每个部分，一旦心脏脱离原本的位置太久，就会突然faint。神经系统彻底坏掉一次，然后又好了，神经又能重新掌管身体，迎来一个新的巡回。

对不同事物的反应不一样，转变在转瞬之间。

我觉得我有多个人格的原因是大脑的控制部分只对我身体的一部分负责，如果这部分的大脑失去对大脑的控制，换另一个部分控制身体，会有完全不同的效果。我的男生和女生部分就不一样，医生说mtf可能是基因变异导致的，我觉得也有可能，一个控制大脑的部分突变了，就会朝向自己“本来就是女生”的地步发展。这一切的转换可能就在转瞬之间。

人格同一性

人格同一是一个人是否还是以前那个人的判定标准。如果一个人大脑经历了很多打击，甚至神经递质通路完全变成了另一个状态，那么这个人和之前的人有区别吗？我觉得现在的我在吃药的情况下，没有了犀利的眼光，缺失了和人argue的勇气，似乎离刻板印象中的yyw渐行渐远。

是否是INTJ的通病？

INTJ就是一群要强的、一针见血的指出问题所在的紫色小老头，在没有资源限制的情况下，INTJ能指挥千军万马，可是在资源匮乏或者大脑失去控制的情况下，会陷入“我真的好没用”的无限循环中。人格虽然变了，但是我认为我的内心动力一直没有变。INTJ如果失去一个健壮的、善于表达的大脑，很可能被自己过往的丑事击败，所以只有一直维持在一个智商的高位，才能满足一个INTJ的内心。所以大脑是INTJ最薄弱的环节。

哲学

我现在更加相信以前不太认可的哲学，诸如"人不可能同时踏入同一条河流"，"人的同一性"，“人的大脑是人的一部分、心脏也是人的一部分，那么人失去了大脑以后还是之前的人吗？人嫁接了心脏起搏器还是原来的人吗？”我现在觉得人脑里面不同神经代表了对不同器官的控制，人脑细微的改变都可能改变其神经递质传输的过程，从而改变人的思想及其行为。

人是意识的总和

人的一切思考都基于神经递质的传播，而神经递质又依赖于当时这个人的激素水平，外部刺激。我认为人就是一个意识的总和。

人是解释性的动物，任何表达都是可解释的，只是理解困难程度

人总是擅长给一件事物做解释，人没有办法完全理解一件事物的时候，才会通过超灵的表达来诠释一个事物。我起初并不知道我的病的类型，

不同文化的人碰撞出来的东西、在同一个维度交流、才更有意义。人只不过是跑了一个foundational的解释性的model，dimm or not，strong opinion or not。

人的大脑的神经递质传输过程，和foundational model的产生perceptron加神经网络的过程很类似。而训练数据就是过往产生的一系列神经递质的强化能力，所以他们是类似的。若要能产生新的东西，得由不同的来自各方的观点碰撞出来，也就是训练数据的多元化。

女性真的通灵，但是可能她自己都不知道。

我的母亲非常能理解我在失去意识时作出的描述性言语。似乎这是女性的第六感感觉到的？我和母亲在微信电话中的没有

女性容易得阿尔兹海默症是不是过于通灵？

我和母亲的交流，发现她能察觉到文字以外我的情感，这些东西可能表现在细节的面容上。在我躯体化症状发病的时候，在电话另一头听到我受不了的时候，我母亲精确的提供了我的感觉的信息，而这些信息被我爸是过滤掉的。我不知道什么时候我妈也会随我外婆一样得阿尔兹海默症，但是能够察觉通灵，我感觉和阿尔兹海默症发病预兆很有关系。