Project👁 - vickieGPT’s blog

May 21, 2025May 21, 2025

PTX-Level GPU State Dump for Live Migration

Overview of PTX vs SASS State

Live-migrating a running GPU computation requires capturing the execution state – including each thread’s registers, relevant memory regions, and program counters. NVIDIA GPUs execute SASS (Streaming Assembler) machine code, whereas PTX is a higher-level virtual ISA used at compile-time. There is not a one-to-one correspondence between PTX instructions and SASS instructions in optimized code. A single PTX instruction may be compiled into multiple SASS instructions or even eliminated or fused with others during optimization. This gap makes it challenging to directly derive PTX-level state (e.g. PTX register values and the exact PTX program counter) from the low-level SASS state. In other words, the GPU’s actual state exists in terms of SASS (hardware registers, SASS PC, etc.), and recovering the abstract PTX state requires careful mapping and tooling.

Challenges in Mapping SASS Back to PTX

NVIDIA’s PTX is an virtual assembly that gets translated by the ptxas compiler into SASS (target-specific machine code). During this translation, numerous optimizations and transformations occur. For example, an arithmetic operation in PTX might be strength-reduced or combined with another, so it no longer appears as a distinct instruction in SASS. Some PTX instructions expand into an emulation sequence of several SASS instructions (e.g. 64-bit operations might become multiple 32-bit SASS ops). Conversely, some SASS instructions (like address calculations or predicate moves) have no direct PTX source equivalent – they might be introduced by the compiler. As a result, the correspondence between a PTX instruction and the runtime SASS state is often many-to-many or non-existent in complex optimized code. This means we cannot naively take a SASS program counter or register and assume it maps to a single PTX line or variable. In general, reliably reconstructing PTX-level program state from an arbitrary optimized SASS state is difficult. It may only be feasible in limited scenarios – for example, code compiled with minimal optimization or with debug information that preserves a closer mapping between PTX and SASS.

Tools for Disassembly and Code Mapping

To bridge the gap between SASS and PTX, several tools can help inspect binaries and provide mapping information:

cuobjdump: NVIDIA’s CUDA Binary Utility for examining compiled binaries. It can extract and disassemble SASS, and also dump embedded PTX if the binary contains it (e.g. fatbins often include PTX as JIT fallback). This is useful for static analysis: you can see the machine code instructions and resource usage per kernel. However, cuobjdump does not align SASS with source or PTX lines – it simply lists instructions. (In fact, an attempt to get interleaved CUDA C++ source with SASS using cuobjdump is not possible via command-line.)
nvdisasm: A more advanced disassembler that operates on cubin files. Unlike cuobjdump, nvdisasm has richer output options. Critically, if the binary was compiled with debug symbols or line info (using -G or -lineinfo flags), nvdisasm can report the source line correspondence for each SASS instruction. Using nvdisasm -g will annotate the disassembly with references to source lines. If the source was CUDA C++, those will be C++ lines; if the source was PTX (e.g. compiled from a .ptx file with line info), those can be PTX line numbers. This feature helps map a SASS program counter (address) to the nearest PTX instruction. Note: the output shows line references (file and line number) rather than the actual source text, but it provides the needed correlation. By compiling the code with -lineinfo (and ensuring PTX is embedded for the target architecture), one can later use nvdisasm to find which PTX instruction corresponds to a given SASS address.
Nsight Compute (GUI): NVIDIA’s profiling tool can display source, PTX, and SASS side-by-side in its Source page. This requires the binary to include PTX (for the target or as intermediate) and line mapping info. While this is a GUI tool (not automated via CLI), it confirms that the compiler can produce the mapping when requested. In practice, for our purposes, we would use the underlying mechanisms (DWARF debug info in the cubin) via nvdisasm or other CLI parsing rather than the GUI. Nsight Compute essentially leverages the same line info to link SASS to PTX or high-level source.
CUDA Compiler Flags: To facilitate mapping, it is recommended to compile kernels with debug or line information. Using -G (device debug) or -lineinfo will embed debug symbols that map PTX or source lines to SASS. There is also -src-in-ptx, which interleaves source code as comments in the PTX output for human reference (helpful for debugging). In some cases, one might compile the CUDA C++ to PTX (-keep) and then use ptxas with -g to get a cubin where the PTX file’s line numbers are referenced in SASS. This way, the PTX itself is treated as “source” for debugging purposes. In summary, ensuring your binary carries PTX and line mappings is crucial for any PTX↔SASS correlation.

Table 1: Key Tools and Methods for Inspecting GPU Code and State

Tool / Method	Purpose	Notes and Usage
cuobjdump (binary utility)	Disassemble and extract contents of CUDA binaries (fatbins). Shows SASS, resource usage, embedded PTX.	Works on executables or cubin files. Does not auto-map SASS to PTX lines (manual correlation needed).
nvdisasm (disassembler)	Detailed disassembly of cubin with control-flow info and source line references.	Use `-g`/`-gi` with a cubin compiled with `-lineinfo` to get SASS with file:line annotations. Enables mapping PC addresses to PTX or source.
cuda-gdb (debugger)	Can attach to running kernels (in device debug mode) to inspect state at runtime.	Works at SASS level; allows querying registers and warps. No direct PTX source view (PTX must be mapped manually). Useful for concept of reading registers if kernel is halted.
NVBit (dynamic instrumentation)	Inject custom monitoring code into GPU binaries at runtime. Can intercept and modify SASS instructions on the fly.	Allows inspecting ISA-visible state (registers, predicates, etc.) and logging or modifying it during execution. Useful to capture register values or program counters by instrumenting the kernel binary.
CUPTI / Profiling	Profiling interface (PC sampling, counters).	Can sample the program counter over time to statistically map hotspots to source, but not designed for exact state dump. Does not provide direct register dumps.
CUDA Checkpoint API (CUDA 12)	Driver-level API to checkpoint GPU memory state for a process.	In CUDA 12, functions like `cuCheckpointProcessLock/Unlock/Checkpoint/Restore` can save all GPU memory to host. However, this excludes live registers/PC mid-kernel – kernels must complete or be preempted. Mainly handles memory and resets GPU context.

Capturing GPU Memory State

Memory state is the easiest part of the GPU context to capture, and it’s largely independent of PTX vs SASS issues. A GPU program can use multiple memory spaces (global, shared, local, constant, etc.), all of which need saving:

Global Memory (Device DRAM): All allocated global memory buffers (e.g. via cudaMalloc) must be copied out. This can be done with standard API calls (cudaMemcpyDeviceToHost for each buffer) if the kernel is paused or not actively writing. NVIDIA’s official checkpoint utility automates this: when you call the driver API cuCheckpointProcessCheckpoint, “the GPU memory contents will be brought into host memory” (and GPU pointers are invalidated). Thus, the bulk of data (tensors, etc.) can be saved straightforwardly. Constant memory and texture memory (if used) are typically initialized from host and read-only during kernel execution, so they can be reinitialized on the target side rather than dynamically dumped.
Shared Memory (SMEM): Shared memory is on-chip and private to each thread block. If a kernel is mid-execution, each resident block has its own shared memory contents that are not directly accessible from the host. There is no built-in API to read shared memory of a running block. To capture it, one approach is to instrument the kernel: e.g., insert SASS instructions at a checkpoint to copy the shared memory region into global memory (so it can then be copied to host). Another approach is to rely on hardware preemption: on GPUs that support thread-block preemption, the scheduler could save the state of SMEM to DRAM as part of context save – but such mechanisms are internal and not exposed publicly. In practice, application-level checkpointing schemes avoid needing to snapshot arbitrary shared memory by ensuring kernels reach a synchronization point or end-of-kernel where shared memory is naturally either not in use or can be retrieved via normal means (since the kernel ended). For a truly live mid-kernel migration, one would have to emulate what the GPU does during a context switch: allocate a buffer for each block’s SMEM and copy it out. This likely requires custom SASS instrumentation (since no API gives SMEM content directly).
Local Memory: Local memory is memory private to each thread, used for register spills or large local arrays. It is actually allocated in global memory space (per-thread heap) and automatically indexed by the GPU. If a thread has values in local memory (spill slots), those are part of the overall device memory and would be included in the global memory snapshot if the entire GPU memory is checkpointed. In an application-managed approach, one could include known local arrays, but spills are harder to identify externally. However, by capturing the entire device memory (or at least the CUDA context’s allocations), we cover local memory. The challenge is knowing which portions correspond to which thread’s locals – that mapping is inside the SASS code and ABI. If needed, one could parse the cubin for the kernel’s stack frame size to know how much local memory per thread is used. For PTX-level restart, it might be acceptable to simply restore all local memory contents as they were.

In summary, memory state capture is well-supported: NVIDIA’s checkpoint tools lock the GPU, wait for work to quiesce, and then snapshot all memory to host. For live migration, you’d transfer these memory dumps to the destination GPU and restore them (e.g. via cuCheckpointProcessRestore which reverses the process, or manually cudaMemcpy them to the new context). The hardest part is ensuring consistency if a kernel is mid-run – hence a need to pause the kernel.

Capturing Registers and Program Counter State

Capturing the register state and program counters of threads on the GPU is the most challenging aspect. Unlike memory, there is no public API to dump the registers of all threads in a running kernel. GPU registers are distributed across streaming multiprocessors and are heavily optimized by the hardware. Key considerations:

Per-Thread vs Per-Warp Execution State: On pre-Volta architectures, threads in a warp share an execution PC (program counter) – they execute in lock-step (with divergence managed by predication and reconvergence). Starting with Volta (Compute Capability 7.0), NVIDIA introduced independent thread scheduling, which gives each thread its own PC and enables finer-grained scheduling. Internally, Volta GPUs assign two hardware register slots per thread to store the thread’s program counter. This means the PC has become part of the thread’s architectural state (just like registers) rather than an implicit warp-level pointer. For checkpointing, this is good news: it is possible to capture each thread’s PC value if we can read those special registers. The bad news is that user-level code cannot directly read these PC registers under normal operation.
No Direct Register Access: GPU kernels cannot read or write arbitrary registers directly; they only manipulate registers through instructions the compiler emits. NVIDIA does not expose a mechanism like x86’s context save/restore to user programs. Even the CUDA driver’s checkpointing API avoids dealing with registers explicitly – it requires that the process be locked (i.e. no active kernel running) before checkpoint, thus sidestepping the issue of mid-kernel register capture. Essentially, NVIDIA’s official approach is to finish or pause the kernel, then save memory, assuming you’ll restart the kernel from the beginning on restore. Truly capturing a snapshot in the middle of kernel execution requires either special hardware support or clever software instrumentation.
Hardware Preemption: Newer GPUs (Pascal and Volta onward) introduced limited preemption ability. In Volta, thread blocks can be suspended and resumed, which implies the GPU hardware/driver can save the register state of all threads and reload it later. However, this functionality is used internally by the scheduler and not exposed through CUDA APIs for general checkpointing. For example, on time-out or higher-priority preemption, the GPU context switch mechanism will spill out registers to memory. As developers, we don’t have direct hooks into this pipeline. If one were implementing a hypervisor or low-level driver for virtualization (e.g. NVIDIA’s SR-IOV vGPU for data center), there are driver-level methods to snapshot a virtual GPU’s state (which includes registers, engines, etc.). Those are not accessible from user space. Therefore, our options at user-level reduce to either using the CUDA debugger or modifying the code.
Using a Debugger (CUDA-GDB): In a debugging session (with cuda-gdb and a kernel compiled for debug), one can break execution of a kernel. This effectively pauses the SMs, and the debugger can read registers, shared memory, etc., for the current point. cuda-gdb works at the SASS level and provides commands like info registers for the current warp. It is conceivable to leverage this in a custom way – e.g., attach to the process, programmatically interrupt the kernel, and scrape the register values and PCs. However, this approach would be quite complex and is not an intended use of the API. As NVIDIA’s experts note, the debugger is “always working at the SASS level” and doesn’t present a PTX-source view, so one would still need to correlate back to PTX manually. This could be a last-resort method to get a one-time snapshot of registers for a few warps, but it’s not a general solution for live migration (especially at scale, with potentially thousands of warps).
Dynamic Instrumentation (NVBit): A more promising approach is to use binary instrumentation to insert code that dumps register state at runtime. NVIDIA’s NVBit framework allows you to inject custom instructions into the GPU binary as it’s loaded, without needing driver source or special privileges. With NVBit, one can intercept each SASS instruction or each basic block and execute additional code. For instance, you could instrument the kernel to periodically check a “should checkpoint” flag in global memory and, if set, have each thread store its register values to a designated memory buffer. NVBit makes this easier by providing an API to query an instruction’s operands and to insert calls to user-defined device functions. The NVBit paper notes that it “allows inspection of all ISA-visible state” and even modification of register state at runtime. Practically, you might implement a tool that, at a particular PC (say a label in PTX or a known synchronization point), injects a sequence that writes out all general-purpose registers (and predicate registers, condition codes, etc. as needed) to global memory. The output could be structured per thread (e.g. an array indexed by thread ID). Then you would signal the kernel to execute this dump (perhaps by setting a flag via host). After the dump, you’d have a memory region on device containing a full register snapshot for each thread, which you can copy to host. This is complex but achievable with instrumentation – albeit with non-trivial overhead and engineering effort.
SASS-Level Checkpoint Stubs: Prior research has attempted to modify SASS or use assembly stubs for checkpointing. For example, the Cricket GPU virtualization project directly injected SASS code to capture device execution state. They essentially reverse-engineered how to save and restore GPU state by manipulating the binary. Another project (Singularity) used cuobjdump to get information about kernel parameters and PTX, then intercepted the JIT compiler to extract signatures, in order to help reconstruct state. Both approaches were reported to be error-prone and high-overhead, due to the complexity of reverse-engineering and the differences across GPU generations. These attempts underscore how challenging manual state capture is – they worked, but at a significant complexity cost. Modern alternatives like NVBit provide a more structured way to inject code without fully manual binary editing.
Program Counter Mapping: To resume a kernel at the correct point, we need the program counter per thread (or per warp on older GPUs). If using instrumentation, one could capture the value of the PC at the moment of checkpoint. NVBit provides the address of each instruction being executed to the instrumentation callback, so you can record the last seen PC for each thread. Alternatively, you can insert an explicit marker in the code: for example, a dummy instruction or a particular unused register value that indicates a checkpoint location, then search for it. But the simpler way is to leverage the debug line mapping: given a SASS address we stopped at, use nvdisasm -g or DWARF info to find which PTX line that corresponds to. That tells us the logical point in the PTX program. Essentially, we want to resume at that PTX instruction. One caveat: if the checkpoint is in the middle of what was originally one PTX instruction’s operation (e.g., between the IADD.CC and IADD.X that implement a 64-bit add), we might prefer to roll back to the start of the PTX instruction for a clean resume. This might mean the checkpoint routine should detect such cases or simply always resume at a known safe point (like just before a barrier or a particular label).
PTX Register Reconstruction: After dumping the SASS registers, we have their values but we need to translate them into PTX-level variables if we want an architecture-independent state. In a straightforward scenario (no compiler optimizations), each PTX virtual register maps to a physical SASS register. For example, PTX %r5 might have ended up as SASS R10. If we know that mapping, we can label the saved values accordingly. However, with optimization, the situation is murkier. Some SASS registers have no direct PTX equivalent (they might hold temporary results, loop counters, addresses, etc. introduced by optimization). Some PTX registers might not have any live SASS register if they were optimized out or stored in memory. The best case is if we compiled with -G (no optimization) – then ptxas tries to preserve a close correspondence. Under -G, the compiler also emits symbols for local variables and possibly PTX register names which could be accessible via DWARF info. In practice, to get a PTX-level view, one strategy is: (a) Compile the kernel with line info and minimal optimization. (b) Use the DWARF debugging information in the cubin to map each SASS register to a source variable or PTX value where possible. For example, the DWARF may tell us that at a given PC, a certain high-level variable or PTX register is located in SASS register Rn. This is analogous to how CPU debuggers map machine registers to variables. If such info is available, we can construct a mapping table. If not, a manual static analysis comparing PTX and SASS code might be needed.

In summary, capturing the live register and PC state is possible but requires stepping outside normal user APIs. Using a combination of debug modes and instrumentation is the practical approach. For instance, you might run the kernel in a special mode (with debug info, possibly launched under a tool like NVBit or cuda-gdb) to pause it and extract state. The state extraction would involve writing register values to memory (either via injected code or via the debugger’s queries). Once you have the SASS-level snapshot, you then translate it to PTX-level terms using the compilation metadata.

Reconstructing or Resuming at PTX Level

The end goal of a PTX-level state dump is to enable resuming the computation on possibly a different GPU architecture by using PTX as a portable format. After obtaining the state (memory + registers + PC), how do we resume execution on the target GPU at the PTX level?

Reloading Memory: All the saved memory regions (global, local, etc.) would be copied into the new GPU’s memory. This is straightforward with CUDA APIs (or using cuCheckpointProcessRestore which does it in one go). We must ensure allocation of memory in the new context mirrors the original (same addresses for pointers if the code uses absolute addressing, though CUDA generally uses virtual addresses so that should be fine as long as we reload the module).
Reloading Registers (PTX level): There is no direct way to set GPU registers from host. The solution is to start a new kernel on the target GPU that will consume the saved state. One approach is to have modified PTX code for restart: for example, generate a version of the kernel PTX that begins with code to load the saved register values from memory into PTX virtual registers, then jump to the point in the code where it left off. Since PTX is an assembly-like language, we could imagine inserting a label in the PTX at the resume point and using a bra (branch) or function call to enter at that point with the reconstructed values. The PTX JIT (at runtime) would compile this to the target SASS. This requires careful editing of PTX: essentially splicing the original PTX to create an entry at an arbitrary instruction. It’s tricky but conceptually similar to what a debugger does when it patches code to jump to a breakpoint handler and back. Another concept is to wrap the kernel in a big if(resumeFlag) { goto ResumeLabel; } else { ... normal start ... } and use the saved state to set that flag and initial values. However, PTX and the GPU programming model don’t allow starting in the middle of a kernel easily – so more likely we’d launch a continuation kernel that is semantically equivalent.
Using PTX as an IR: Since PTX is portable, the saved PTX state (registers/variables and PC) should in theory be usable to resume on a different GPU architecture by re-running the PTX through the PTX-to-SASS JIT for that GPU. The question is whether the new architecture will execute it exactly the same way. PTX is designed to be forward-compatible, so a PTX program compiled for a newer SM should behave identically (barring some numeric edge cases). Thus, if we have the exact PTX instruction and the values of all its input registers as of checkpoint time, we can feed those into a new execution. The reliability comes down to whether we captured all relevant state. Besides general-purpose registers, we must consider predicate registers, condition codes (carry flags from add, etc.), and special registers (e.g., SR_TID, SR_CTAID – though those are deterministic from thread indices which we know). We would need to capture those as well. For example, if a warp was in the middle of a divergent if in PTX, the execution mask (which threads are active) is part of the state. On Volta+, per-thread PC implicitly handles this (threads that exited the if will have a different PC than those in the if). On older GPUs, the reconvergence stack state would matter. Capturing that is extremely hard without hardware support. This is a limitation: certain control flow state (SIMT reconvergence information) may not be fully reconstructable from just registers and PC, unless the GPU provides it. The PTX model doesn’t explicitly expose the reconvergence stack. This suggests that our PTX-level state dump is more feasible on Volta and newer (with independent thread scheduling) than on older GPUs.
Reliability of PTX-State Reconstruction: In practice, achieving a fully reliable reconstruction is not guaranteed. If the code was compiled with optimizations that, say, split a PTX operation into several SASS operations, our checkpoint might catch the system in the middle of that operation. The PTX abstraction would consider that one instruction not yet completed. If we resume at the beginning of that PTX instruction, we must ensure the state is as it was before it started – which might mean rolling back a partial result. For instance, a 64-bit addition in PTX compiles to two 32-bit adds (with carry). If a checkpoint occurs after the first 32-bit add has executed (setting a carry flag), and we then resume at the PTX level, we would recompute the 64-bit add entirely. This might actually be fine (you’d just redo the first half), but we must be careful that no side-effects have occurred (memory writes, etc. – usually these complex instructions don’t have partial side-effects visible to memory, so PTX atomicity holds). Another example: PTX bar.sync (barrier) might be implemented via multiple SASS instructions and some internal state in hardware. Checkpointing in the middle of a barrier is complex – better to only checkpoint when all threads have reached a safe point.

Given these complexities, a conservative strategy is to only allow checkpoint at well-defined synchronization points in the kernel (or at least at PTX instruction boundaries which we can enforce by using the mapping info). By syncing all threads (e.g., at a block-wide barrier or after a certain loop iteration) and then checkpointing, we can treat the state as consistent at a higher level. The PTX dump would then be reliable. This is in line with how some application-level checkpointing works (inserting checkpoints in code at safe places). If truly arbitrary preemption is needed, one must account for these edge cases in the resume logic.

Conclusion and Practical Recommendations

Implementing a PTX-level state dump for NVIDIA GPUs during live migration is highly non-trivial, but with the right tools and constraints it can be approached. Key recommendations and findings:

Leverage NVIDIA’s binary utilities and compiler flags to get insight into the PTX↔SASS mapping. Compile kernels with -lineinfo (and include PTX in the fatbinary) so that tools like nvdisasm can map SASS addresses to PTX source lines. This will help identify the PTX program counter corresponding to a given SASS PC during a snapshot.
Use dynamic instrumentation frameworks like NVBit to capture live state. NVBit can insert code to record register values and other architectural state at runtime. This is likely the most flexible method to dump register content and PC for all threads without needing driver modifications. Expect significant overhead during the dumping phase – essentially you momentarily stop useful work to copy out state – but for migration this is acceptable.
Be mindful of optimizations: for more reliable mapping, consider compiling kernels with lower optimization (e.g., -O0 or -G for debugging). This will sacrifice some performance but will simplify state capture by keeping PTX and SASS closer in structure and avoiding elusive compiler transformations. As an NVIDIA forum expert noted, correlating PTX to SASS in highly optimized code is very difficult. In a live migration scenario, one could even use a special migration build of the kernel that is optimized for state extractability.
Memory state should be captured using official or stable methods: either the new CUDA 12 checkpoint API (which will safely copy device memory to host), or manual cudaMemcpy for each allocation. Ensure to also save the contents of any GPU caches or buffers that might not be globally visible (in most cases, global memory and relevant states are coherent or can be rederived, so the main concern is shared memory and registers which we addressed separately).
Understand that complete PTX-level resume portability has limits. In theory, with a full snapshot of all threads’ registers and PC, one could relaunch the computation on a different GPU by feeding the state into a PTX re-execution. In practice, differences in GPU architectures and unseen state (e.g., warp-level execution masks on pre-Volta GPUs, or certain special registers) can make this complicated. If migrating between identical GPU architectures (say two Turing GPUs), it may be simpler to capture the SASS state and resume at SASS level (which NVIDIA’s driver likely does internally for virtualization). For cross-architecture migration, PTX-level recompile is the goal – but expect to handle the nuances discussed (control flow, partial instruction sequences, etc.).
Prioritize official documentation and research: NVIDIA’s documentation on the new checkpoint features and developer blogs should be consulted to understand what the driver guarantees. Research papers like “Checkpoint/Restart for CUDA Kernels” and tools like Cricket and CRIUgpu provide insights into pitfalls others encountered (e.g., the need to intercept and replay CUDA API calls versus leveraging driver state save). These suggest that using the driver’s capabilities (when available) is far more robust than DIY hacks, if you can get access to them.

In conclusion, implementing a PTX-level state dump involves a hybrid of tool-assisted static analysis (to map between PTX and SASS and identify state variables) and dynamic instrumentation or debugging (to actually capture the live register/PC state). While it is possible to approximate PTX state from SASS, it is not fully reliable without careful control: you must account for compiler optimizations and hardware specifics. Whenever possible, force a correspondence (via debug mode or manual checkpoints in code) so that the PTX state is well-defined at the moment of capture. With these precautions, one can extract the needed information (registers, memory, program counters) and attempt a PTX-level resume on a target GPU. But one should remain aware that this is bleeding-edge – even NVIDIA’s own tooling only recently introduced basic checkpoint support (and that focuses on memory state). Thus, a PTX-centric migration strategy will require extensive validation. It’s wise to test on simple kernels (with known PTX-SASS mappings) and incrementally tackle more complex scenarios, using the aforementioned tools to verify that the state extracted and re-injected indeed reproduces the correct execution on the new device.

December 31, 2024February 24, 2025

CXLMemUring Update

1. 背景：CXL 与内存体系的演进

随着 CXL（Compute Express Link）对外设带宽和内存一致性需求的支持不断升级，越来越多的场景（如高性能计算、数据中心、AI 加速等）开始采用 CXL 的缓存一致性特性（CXL.cache）来让外部设备更灵活地访问主机内存。然而，这也带来了新的挑战：

设备的热插拔与故障风险

在传统的 PCIe/CXL.io 模式下，如果设备被热拔或因软硬件错误停机，通常只影响到 I/O 操作；但在 CXL.cache 模式下，外设有可能持有“Modified”状态的缓存行，一旦设备意外退出，则无法安全地将这些缓存行同步回主内存，可能导致数据丢失或更严重的系统级错误（如内核崩溃）。
安全性与侧信道风险

CXL.cache 在带来跨设备缓存一致性的同时，也暴露了更细粒度的访问通道。以往仅在 CPU 内部可能出现的缓存侧信道，如今可能演变为跨设备的安全隐患。
传统的防护方法（如粗粒度的地址访问控制）难以及时而精细地阻断潜在的恶意访问或侧信道攻击。
同步性能与延迟

随着更多设备加入一致性域，诸如共享队列、分布式归约（reduce）操作等场景的内存同步开销会显著上升。硬件需要更多手段来降低同步延迟。

Von Neumann Memory Wall 计算和存储之间的带宽与延迟矛盾依旧是系统瓶颈。即便有了 CXL.cache，对外设的访问在延迟和带宽方面仍难与 CPU 内部缓存层级相媲美。
基于这些问题，业界和学术界都在探索新的硬件/软件协作方案，以便在保证安全性的同时提升可靠性和性能。

2. 现有 SoTA（State-of-The-Art）方案及不足

2.1 现有典型方案
单向或异步加载/存储

通过在外设或内存侧增加多级缓冲或流水线机制，以期减少 CPU 对外设频繁通信时的阻塞。
在高性能计算或数据中心场景下，也有通过大页（Huge Page）、NUMA 优化等方法，减轻 CPU-设备间的通讯负担。
数据流加速器（Data Streaming Accelerator）

在 GPU、FPGA 或定制加速器中，往往引入类似 SPDK、DSA（Intel Data Streaming Accelerator）这样的技术，以批处理方式提升大吞吐数据操作的效率，减少内核干预。
大页或段级别的防护/隔离

在内存管理层面，通过页表标记等方式提供一定程度的访问隔离，把外设可访问的地址范围限制在特定分区/页中，避免无关的数据泄露。
2.2 现有方案的局限
Cache 污染与侧信道仍难彻底解决

即使采用大页或统一缓存一致性，外设在与 CPU 共享部分 LLC（Last-Level Cache）时，仍可能带来新的共享缓存侧信道。
粗粒度的防护手段难以阻止细粒度的攻击。
设备断开时的数据一致性问题

CXL.cache 模式下，如果关键数据一直处于外设的缓存行中（Modified 状态），当设备意外下线，将导致数据无法回写或者系统出现一致性错误。
目前缺乏一种硬件级的快速回退（rollback）或事务式保护机制。
同步操作的开销高

随着分布式、并行化的计算模式增多，共享数据结构（如环形队列、归约缓冲区等）的访问需要更多内存屏障或锁操作；跨设备的锁与屏障机制往往造成显著的延迟堆积。
编程与调试难度大

现有体系需要在设备驱动、内核、应用层多处改动；开发者若想在总线或内存级别做更灵活、更细粒度的管理，需要投入大量复杂的硬件/软件配合。

3. 新思路：在内存总线增加 eBPF-like 指令支持

为同时兼顾可靠性、安全性和性能，我们提出在主机与 CXL/root complex 之间增设一个“微型 eBPF-like 指令处理单元（tiny eBPF CPU）”的方案，核心思想如下：

总线级事务式内存操作（Transactional Memory on Bus）

允许外设将一组对关键控制面的内存写操作打包（类似 eBPF 程序），由总线侧的小型处理器执行并保证事务性。
如果事务执行中途外设意外断连或故障，整个事务可以回滚，避免出现“Modified”缓存行无人回写的情况。
这样一来，外设本身不再直接持有关键数据的独占/修改状态，也降低了热插拔带来的数据丢失风险。
可编程的细粒度安全检查

通过在总线侧维护一个可编程匹配表（matching table），结合 eBPF-like 指令，可以根据地址、访问模式、数据模式等进行快速检查和过滤；
实现类似 Linux seccomp 的“白名单”或“黑名单”机制，在更细的缓存行或地址范围上阻断可疑访问，减轻潜在侧信道攻击的风险。
降低同步延迟

由于部分跨设备的原子操作、锁或屏障可在总线侧以事务性执行，某些情况下比传统 DMA 往返更高效；
减少了软件层对外设状态的一致性管理负担，整体提升系统的并发性能。
灵活的可扩展性

eBPF-like 指令可以进一步应用于性能分析或诊断场景，比如在总线层监听和统计特定访问模式，以帮助开发者或系统管理员了解数据流分布，定位瓶颈或安全风险。

4. 实现策略

4.1 硬件结构与可能路径
CXL 专用实现

在 CXL Root Complex 侧增加一个微处理器核，用于执行 eBPF-like 指令包；
提供专门的 side-band 通道给外设发送这些指令包；
需在 CXL 规范中扩展或定义新协议字段，以实现对 CXL.cache 设备的安全事务化管理。
复用现有 NoC/总线协议

在诸如 TileLink、AMBA-AXI/ACE/CHI 等片上网络或总线协议中，加入 side-band 通道来传输 eBPF-like 指令；
好处是可以借助开源 IP 或既有总线生态，易于原型验证；
挑战在于，这些总线方案主要针对片上或极短距离的高带宽互连，如何说服审稿人或业界接受其“跨芯片”或外设场景，需要进一步探讨。
RISCV 扩展与软件评估

在 RISC-V 中扩展 LR/SC 指令序列，让 CPU 或总线能够将这些指令序列包装成 eBPF-like 程序并交给硬件执行；
有利于快速在软件层验证概念，可构建模拟器或 FPGA 测试平台来评估性能增益和安全增强效果。
4.2 关键设计点
事务内存 vs. eBPF

事务内存主要提供原子性和一致性保障；
eBPF-like 部分提供灵活的编程接口，用于地址匹配、安全策略执行、数据过滤等。
二者结合使得硬件具备高可编程性与安全回退的能力。
安全防护与可控回退

可在总线侧设置专门的访问策略：
对关键地址范围强制使用事务模式；
若外设故障，回退该事务并拒绝后续访问；
由此杜绝“Modified”缓存行无法回写的情况。
性能与兼容性

需要平衡新硬件引入的额外时延（如指令解析、表项匹配等）与带来的安全/可控好处；
对于普通访问，可走传统路径，不额外产生开销；
对需要安全/事务保障的访问，才使用 eBPF-like 处理器，减少对整体系统吞吐的影响。

5. 评估与展望

安全与Profiling演示

在小规模实验或原型系统中，通过 eBPF-like 指令完成地址过滤、访问统计，以及在外设异常时触发回退和阻断；
以此证明该方案能有效应对细粒度的安全风险。
事务内存的可行性

对比传统软件锁或 TSX（Intel Transactional Synchronization Extensions）等方式，验证在跨设备场景下，硬件级回退如何避免外设故障带来的数据丢失；
探讨如何防御 TSX 异步中止（Asynchronous Abort）等可能的攻击手段。
对现行 CXL Type 1/Type 2/Type 3 设备的兼容

如果只面向 CXL Type 2（带缓存一致性的设备），则需要在设计上确保对传统纯内存类设备（Type 3）或无缓存一致性的设备也能部分兼容；
在不同类型设备上分别测试性能和安全收益，形成全面的评估报告。

6. 总结

面对未来异构计算与内存层次日益复杂的趋势，“在内存总线增加 eBPF-like 指令支持”是一种兼顾安全性、可靠性及性能的前瞻性方案。它通过在总线侧对关键地址访问进行事务化处理并融合可编程逻辑，实现了：

避免设备热插拔和故障所引发的关键数据丢失；
在更精细的粒度上进行安全访问控制，减轻侧信道风险；
减少软件同步开销，提高分布式或并行场景下的整体效率。
尽管实现过程中需要对现行 CXL 或其他总线协议做出一定扩展，但从概念验证（PoC）到实际硬件落地，都具备可行的技术路线与潜在价值。后续仍需在事务内存模型、eBPF-like 指令的语义及安全性加固策略上持续探索与完善。

参考文献

Fan Yao, et al. “CACo: A Comprehensive Approach to Cache Covert Channel Mitigation”, HPCA, 2018.

June 6, 2024June 23, 2024

ClickHouse IOUring

iouring fs插桩bpf uring context，到最底下的nvme层dispatch一段batching读的代码。iouring sock xdp一个请求接一个请求dispatch。穿透两层，一个是iouring层，另外是xdp和xrp插桩的地方层。
MergeTree到ReplicatedMergeTree，想用iouring batch socket接read的一些请求。

March 7, 2023January 13, 2024

Qemu CXL type 1 emulation proposal

Introduction to CXL Type 1

Guided Usecase

[1] and [2] are just Qemu's implementation of dm-crypto for LUKS; every device mapper over a physical block device will require a key and a crypto accelerator or software crypto implementation to decrypt to get the data. We implement a crypto accelerator with CXL type 1 semantics over a framework of virtio-crypto-pci. We want to emulate the mal state or unplug the crypto device; the kernel will get ATS bit DMAed data and resume CPU software crypto implementation.

Device emulation

DMA and access memory

Create a CacheMemRegion that maps a specific SPP region for one mapping of a bunch of CXL.cache caches on a CXL device.

Crypto operations

When calling crypto operations in the kernel, we actually offload the encrypt/decrypt operations to the type 1 accelerator through CXL.io, which tells the device cast on operation on arbitrary SPP. The accelerator will first take ownership of arbitrary SPP in the CacheMemRegion and notify the host. Eventually, the host will get the shared state of the SPP's cacheline.

Cache coherency emulation

struct D2HDataReq = {
    D2H DataHeader 24b;
    opcode 4b;
    CXL.cache Channel Crediting;
}
struct CXLCache = { 
    64Byte Data; 
    MESI 4 bit;
    ATS 64 bit;
    [D2HDataReq;n] remaining bit;
}

Use metadata with Intel SPP writes protection support and mark the access to an arbitrary cacheline in the SPP. We need to perform all the transaction descriptions and queuing in the 64-byte residue data in the SPP. The arbitrary operation, according to the queue, will have the effect of MESI bit change and update writes protection for the subpage and root complex or other side effects like switches changing.

The host's and device's requests are not scheduled FIFO, but the host's seeing the data will have better priority. So, the H2D requirement will be consumed first and done in D2H FIFO. All the operations follow interface operations to CXLCache.

Taking exclusiveness-able

We mark the Transportation ATS bit as exclusiveness-able. and copy the cacheline in another map; once emulated and unplugged, the cacheline is copied back for further operation of the kernel to resume software crypto calculation.

How to emulate the eviction

We have two proposals

pebs to watch the cache eviction of the physical address of an SPP
Use sub-page pin page_get_fast to pin to a physical address within the last-level cache. [7]

The code is currently developed at https://github.com/SlugLab/Drywall/ and pitfalls refer to https://asplos.dev/wordpress/2023/11/27/intel-sub-page-write-protection-cai-keng/

Alternative implementation

Maybe we should limit the operation of the address from CXL.cache, write a memory santinization assisted JIT compiler and use LAM instead.

Reference

December 8, 2022November 18, 2023

[CSE231 Project] Multi-V Virtual Machine - A binary offloading paradigm onto the phone

October 9, 2022January 8, 2023

Project Proposal for CSE231

August 9, 2022January 8, 2023

Final Report of Program Analysis

March 28, 2022November 18, 2023

[Computer Architecture] Paper Reading

March 3, 2022March 25, 2022

[Computer Architecture] Sniper lab2

The code is at http://victoryang00.xyz:5012/victoryang/sniper_test/tree/lab2.

The lab was to implement the Hash perceptron according to the paper "Dynamic-Branch-Prediction-with-Perceptrons".

The cache implementation is in common/performance_model/branch_predictor.cc

#include "perceptron_branch_predictor.h"
...
else if (type == "perceptron")
{
   return new PerceptronBranchPredictor("branch_predictor", core_id);
}

The modification in the cfg

[perf_model/branch_predictor]
type = perceptron    # add Perceptron
mispredict_penalty=8 # Reflects just the front-end portion (approx) of the penalty for Interval Simulation

The constructor is passed to the perceptron_branch_predictor.h, we have to maintain the member variable as below:

struct PHT {
        bool taken;     //jump or not
        IntPtr target;  //64 history target address. 
        PHT(bool bt, IntPtr it) : taken(bt), target(it) {}
    };//The pattern history table, set like a round linked list to avoid the memory copy

//To store the history result.
		SInt32 history_sum;
    UInt32 history_bias_index;
    std::vector<UInt32> history_indexes;
    std::vector<PHT> history_path;
    UInt32 history_path_index;

    std::vector<PHT> path;
    UInt32 path_index;

    UInt32 path_length;
    UInt32 num_entries;
    UInt32 num_bias_entries;//1024
    UInt32 weight_bits;
    std::vector<std::vector<SInt32>> weights;
    std::vector<SInt32> perceptron;           //Update the perceptron table to 1024 entries
    UInt32 theta;															//Threshold, to determine the train is good
    SInt64 count;														  //To store the count
    UInt32 block_size;											  //block sides initialize to perseption

    // initialize the branch history register and perceptron table to 0
PerceptronBranchPredictor::PerceptronBranchPredictor(String name, core_id_t core_id)
    : BranchPredictor(name, core_id), history_sum(0), history_bias_index(0), history_indexes(64 / 8, 0), history_path(64, PHT(false, 0)), history_path_index(0), path(64, PHT(false, 0)), path_index(0), path_length(64), num_entries(256), num_bias_entries(1024), weight_bits(7), weights(256, std::vector<SInt32>(64, 0)), perceptron(1024, 0), theta(296.64), count(0), block_size(8), coefficients(64, 0)

The prediction method

The Base neral predictor is based on the equation $y=w_{0}+\sum_{i=1}^{n} x_{i} w_{i}$, We have that

// if the prediction is wrong or the weight value is smaller than the threshold THETA, 
// then update the weight  to train the perceptron
weights = floor(1.93 * blocksize + 14)

Hashed Perceptron Multiple

set of branches assigned to same weights
More than one set of information can be used to hash into the weight table
Diferent hashing function can be used such as : XOR, Concatenation etc.
Diferent indices can be assigned to diferent weights

Algorithm

The input is ip, and first come to a computation of (ip >> 4) % num_bias_entries;

This give the seed to the weight table lines:

table of perception will get update to te coefficient*weight and

and weight table will update to weights[index[i]][i*8...(i+1)*8]*coefficients[i*8...(i+1)*8]*h

if the result >0 the prediction is right then jump, or not jump.

Then update the history and insert the round linked list.

bool PerceptronBranchPredictor::predict(IntPtr ip, IntPtr target)
{
    double sum = 0;
  //hash method
    UInt32 bias_index = (ip >> 4) % num_bias_entries;

    sum += bias_coefficient * perceptron[bias_index];
    history_bias_index = bias_index;
  //update the weight
    for (UInt32 i = 0; i < path_length; i += block_size)
    {
        IntPtr z = ip >> 4;
        for (UInt32 j = 0; j < block_size; j++)
        {
            z ^= (path[(path_index - i - j + path_length - 1) % path_length].target >> 4);
        }
        UInt32 index = z % num_entries;
        history_indexes[i / block_size] = index;
        for (UInt32 j = 0; j < block_size; j++)
        {
            SInt32 h = path[(path_index - i - j + path_length - 1) % path_length].taken ? 1 : -1;
            sum += h * weights[index][i + j] * coefficients[i + j];
        }
    }
    bool result = ((SInt32)sum) >= 0;
    history_sum = (SInt32)sum;

    history_path.assign(path.begin(), path.end());
    history_path_index = path_index;
    path[path_index] = PHT(result, target);
    path_index = (path_index + 1) % path_length;

    return result;
}

The train process

Algorithm

Input parameters: predicted, actual, ip
auxiliary parameters: history_sum(predicted value), history_bias_index, history_indexes(index). history_path(history), history_path_index(history subscript)

for each bit in parallel
	if t=xi then
  		wi=wi+1
    else
    	wi=wi-1
    end if

Realization

for (UInt32 i = 0; i < path_length; i += block_size)
        {
            index = history_indexes[i / block_size];
            for (UInt32 j = 0; j < block_size; j++)
            {
                bool taken = history_path[(history_path_index - i - j + path_length - 1) % path_length].taken;
                if (taken == actual)
                {
                    if (weights[index][i + j] < (1 << weight_bits) - 1)
                        weights[index][i + j]++;
                }
                else
                {
                    if (weights[index][i + j] > -(1 << weight_bits))
                        weights[index][i + j]--;
                }
            }
        }

The result

FFT with Blocking Transpose
   1024 Complex Doubles
   1 Processors
   65536 Cache lines
   16 Byte line size
   4096 Bytes per page

	pentium_m	perceptron
num correct	63112	65612
num incorrect	4342	2112
IPC	1.36	1.37
mpki	2.71	1.39
misprediction rate	6.436%	3.118%
Elapsed time	4.46s	5.23s
cycles	1.2M	1.2M
Instructions	1.6M	1.6M

Reference

https://github.com/ChrsMark/BranchPredictors

March 3, 2022March 3, 2022

[Computer Architecture] Sniper lab3

The code can be viewed on http://victoryang00.xyz:5012/victoryang/sniper_test/tree/lab3

false sharing

For the false sharing test cases. We've given the lab3 cfg file that the cache line is 64B. So that we just need to set the false sharing variable under that cache size.

In the false_sharing_bad.c, we open 2 thread to store global variable results with first part th1 visits and the second part th2 visits.

void* myfunc(void *args){
    int i;
    MY_ARGS* my_args=(MY_ARGS*)args;
    int first = my_args->first;
    int last = my_args->last;
    int id = my_args->id;
    // int s=0;
    for (i=first;i<last;i++){
        results[id]=results[id]+arr[i];
    }

    return NULL;
}

The result of this in the sniper.

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:1] -- DONE --
[TRACE:2] -- DONE --
s1 = 5003015
s2= 5005373
s1+s2= 10008388
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 627.88 seconds
[SNIPER] Simulated 445.0M instructions, 192.9M cycles, 2.31 IPC
[SNIPER] Simulation speed 708.8 KIPS (177.2 KIPS / target core - 5643.3ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 627.79 seconds

In the false_sharing.c, we open 2 thread to store different local variable s with th1 visits and th2 visits.

void* myfunc(void *args){
    int i;
    MY_ARGS* my_args=(MY_ARGS*)args;
    int first = my_args->first;
    int last = my_args->last;
    // int id = my_args->id;
    int s=0;
    for (i=first;i<last;i++){
        s=s+arr[i];
    }

    my_args->result=s;
    return NULL;
}

The result of this in the sniper.

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:2] -- DONE --
[TRACE:1] -- DONE --
s1 = 5003015
s2= 5003015
s1+s2= 10006030
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 533.95 seconds
[SNIPER] Simulated 415.1M instructions, 182.1M cycles, 2.28 IPC
[SNIPER] Simulation speed 777.3 KIPS (194.3 KIPS / target core - 5145.9ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 533.99 seconds

The reason of false sharing:

Every time the thread may let the results to get into the CPU cache.

The Cache may check whether the cache part and memory are the same or not, thus trigger the latency, which is false sharing.

The solution of the false sharing:

Just let the adjacent data's distance larger than the one cache line, say 64B. so set FALSE_ARR to 200000.

The result changed to:

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:1] -- DONE --
[TRACE:2] -- DONE --
s1 = 5003015
s2= 5005373
s1+s2= 10008388
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 512.28 seconds
[SNIPER] Simulated 445.1M instructions, 158.1M cycles, 2.82 IPC
[SNIPER] Simulation speed 868.8 KIPS (217.2 KIPS / target core - 4604.2ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 512.22 seconds

multi-level cache

The non-inclusive cach is to remove the back-invalidation and fix one.

Then I found https://groups.google.com/g/snipersim/c/_NJu8DXCVVs/m/uL3Vo24OAAAJ. That the non-inclusive cache intends to directly write back to the memory. We have to fix some bugs for inclusive cache during the L1 eviction and L2 have to evict, but L2 may not be found, thus the L1 should WB to memory, in this time.

I add a new configuration in the cfg. to make the L1 non-inclusive or not and deployed different Protocol in cfg.

[caching_protocol]
type = parametric_dram_directory_msi
variant = mesif                           # msi, mesi or mesif

I didn't do a result with a small granularity but with lock_add lock_fill_bucket reader_writer and bfs, I got num of write backs and some WB test for inclusive and non inclusive caches.

Test

Lock Add: In this test all the threads try to add ‘1’ to a global counter using locks. We see lower number of memory writebacks in MOSI because of the presence of the owner state.

Lock Fill Bucket: This tests makes buckets for numbers, so that it can count how many times each number is present in the array. This is done using locks. The dragon protocol is performing much worse here compared to others. This is probably because updates to the buckets are not always needed by other processors, hence the updates on the writes do not help.

Result

Protocol

Protocol\IPC	lock_add	lock_fill_bucket	reader_writer	bfs
MSI	1.31	1.32	1.27	1.39
MESI	1.35	1.36	1.29	1.39
MESIF	1.35	1.36	1.30	1.39

The MESIF protocol enhances the performance for teh multicore systems. MESIF protocal enhances the performance for multicore system. It may be aware of the micro arch.

CPI stack

The CPI stack can quantify the cycle gone due to memory branch or sync. literally all of the protocal has similar graph as above. The MESIF shows the lowest time in mem and sync.

Protocol\L2 miss rate	lock_add	lock_fill_bucket	reader_writer	bfs
MSI	49.18	20.13	27.12	42.24
MSI (with non-inclusive cache)	47.98	20.01	29.12	42.13
MESI	46.21	21.13	31.29	41.31
MESI (with non-inclusive cache)	45.13	21.41	26.41	42.15
MESIF	45.71	20.12	25.14	41.39
MESIF (with non-inclusive cache)	46.35	23.14	24.14	41.13

The non-inclusive cache have a better score than inclusive ones in L2 miss rate and MESIF & MESI are better than the MSI.

Summary

Interesting Conclusions

Adding ‘E’ state:

To measure the performance differences caused by adding the Exclusive state to the protocols, we can look at the differences in metrics in MSI vs MESI and MESIF. The main benefit of the Exclusive state is in reducing the number of snooping bus transactions required. If we consider a parallel program where each thread works on a chunk of an array and updates only that chunk, or if we assume a sequential program that has a single thread, then in these cases, there will be a lot of cases where a program is reading a cacheline and updating it. In MSI, this would translate to first loading the cacheline using a BusRd moving to the S state, and then performing a BusRdX and moving to the M state. This requires two snooping bus transactions. In the case of MESI, this can be done in a single transaction. The cache would perform a BusRd moving to the E state and then since no other cache has the cacheline, there is no need of a BusRdX transaction to move to the M state. It can just silently change the status of the cacheline to Modified.
This gives a significant boost in programs which access and modify unique memory addresses.

Write-Invalidation vs Write-Update

Since we have implemented both write invalidation and write update protocols, our simulator can also tell whether for a given program or memory trace, write invalidation protocols will be better or write update.
For a write-invalidation protocol, when a processor writes to a memory location, other processor caches need to invalidate that cacheline. In a write-update protocol, instead of invalidating the cachelines, it sends the updated cacheline to the other caches. Therefore, in cases where the other processors will need to read those values in the future, write-update performs well, but if the other processors are not going to be needing those values, then the updates are not going to be of any use, and will just cause extra bus transactions. Therefore, the effects of the protocol would be completely dependent.
From our tests, we saw lesser number of bus transactions This would explain why updating rather than invalidating reduced the number of bus transactions.

MOESI (Unchecked)

I refer to the graph of

The code is in dir_moesi directory with one state added.

I just make it runnable and not yet tested.