How to map SIMT model onto tenstorrent device?

1. Overview of the SIMT Model

In SIMT, a group of “threads” (a warp) executes the same instruction in lock-step, with divergence handled via masking and predication. Each warp has a single program counter, and individual lanes (threads) may be inactive on divergent branches by using a per-lane mask. 维基百科
On GPUs (e.g., NVIDIA), programmers launch a kernel as a grid of thread blocks; each block contains multiple warps. The runtime hardware scheduler assigns warps to Streaming Multiprocessors (SMs), and within each SM all lanes advance in lock-step until divergence occurs. CliffsNotes

2. Tenstorrent’s Many-Core Architecture

Tensix Cores and Vector Units
- Each Tensix core is a full RISC-V CPU with its own register file, scalar ALU, and a vector/matrix unit (called the Vector Processing Unit, or VPU). There is no built-in concept of “warps” across multiple cores—each core fetches its own instruction stream. EE Times Martin's website/blog thingy
- Grayskull (the first generation) featured a 64-lane VPU with 19-bit FP support; Wormhole/Blackhole generational upgrades reduced this to a 32-lane VPU with 32-bit FP support. Because the RISC-V core is not inherently “SIMT-aware,” programmers rely on explicit masking within the VPU to emulate lock-step execution at the lane level. Martin's website/blog thingy
Spatial Tiling and Core Grid
- Tenstorrent devices are built as a 2D grid of Tensix cores connected via a high-bandwidth, low-latency mesh network. Each core has a slice of local (L1) SRAM and access to shared L2 memory across the chip. EE Times docs.tenstorrent.com
- Unlike CUDA’s implicit warp scheduling, Tenstorrent programs must explicitly partition work across cores (often via Metalium or MLIR). The compiler then places compute operators (e.g., matrix multiplies) onto specific cores according to a logical “core grid” (e.g., 7×1 grid of cores for one operator). docs.tenstorrent.com docs.tenstorrent.com

3. Why SIMT → Tenstorrent Requires Rethinking

No Hardware Warp Scheduler
There is no hardware unit that automatically groups threads into warps or handles divergence scheduling across multiple Tensix cores. Each Tensix core operates independently unless the program explicitly coordinates them. EE Times Hacker News
Divergence Is Explicit
In SIMT, divergence is hidden behind hardware masks; on Tenstorrent, one must set mask registers manually to disable lanes within the VPU when conditional branches occur. There is no implicit warp⁠-level stack. Martin's website/blog thingy EE Times
Memory Hierarchy Differences
GPUs expose shared and global memory with well-understood semantics (e.g., shared in CUDA). Tenstorrent’s cores each have local scratch (L1), with explicit direct memory access (DMA) units to fetch data from DRAM into local SRAM. One must orchestrate when and how each core reads/writes memory. docs.tenstorrent.com docs.tenstorrent.com

4. Strategy 1: Emulate Warp-Level Execution Inside a Single Tensix Core

Lane-Level Parallelism via Vector Instructions
- Map one GPU warp (e.g., 32 or 64 threads) directly onto the VPU lanes within a single Tensix core. Use vector registers (e.g., v0–v31) and mask registers (e.g., m0) to broadcast a single instruction to all lanes. Martin's website/blog thingy GitHub
- Whenever a SIMT thread block would have launched a 1D warp of N threads, rewrite that kernel loop so that each loop iteration populates all N lanes in a vector register (plus any required masking bits for divergence).
Handling Divergence with Predication
- Use Metalium’s mask registers (vmask) to enable/disable specific lanes when encountering an if condition. In CUDA, this happens implicitly; in Metalium you must write, for example:
```
vfcmpeq vmask, vcond, 0          // set mask where condition is false
vmov vout[vmask] = vfalse_value   // masked move: only lanes where mask=1 are updated
```
- On divergent branches, split execution paths into lane-active and lane-inactive sets, updating masks accordingly. This is manual predication rather than hardware-driven. Martin's website/blog thingy EE Times

Example: Vector Add with Divergence

// Suppose GPU kernel: if (threadIdx.x < N/2) out[i] = A[i]+B[i];
// else out[i] = A[i]-B[i];
// Vector lanes = warp size (e.g., 32).
load_vector vA, [baseA]            // loads 32 elements of A
load_vector vB, [baseB]            // loads 32 elements of B
set_vector_lane_idx vidx            // each lane holds its index 0..31
vcmpoge vmask, vidx, (N/2)          // mask lanes where idx >= N/2
vadd vtmp1, vA, vB [~vmask]         // add where mask=0 (idx < N/2)
vsub vtmp2, vA, vB [vmask]          // sub where mask=1 (idx >= N/2)
vor vout, vtmp1, vtmp2              // merge results
store_vector [baseOut], vout

In the above:

vcmpoge creates a mask with 1 for lanes where the condition is true.
vadd … [~vmask] executes add only on lanes where vmask=0.
vsub … [vmask] executes subtract only on lanes where vmask=1.
Finally, vor (bitwise OR) or an unconditional move merges the two vectors.
Martin's website/blog thingy GitHub

Pros & Cons of Single-Core SIMT Emulation
- Pros:
  - Reuses the VPU’s 32/64 lanes to mimic warp parallelism.
  - No need to coordinate across multiple cores for simple kernels.
- Cons:
  - Divergence overhead is higher than real GPU SIMT because predication and merging must be explicit.
  - Performance varies as the VPU width changes across chip generations (Grayskull vs. Wormhole), so kernel must detect or be recompiled. Martin's website/blog thingy

5. Strategy 2: Map “Warps” Across Multiple Tensix Cores

Partition the Thread-Block Over Multiple Cores
- Instead of mapping an entire warp into one core’s VPU, split the warp into sub-groups across adjacent cores in the 2D grid. For instance, treat a 32-thread warp as 4 lanes on each of 8 cores (each core’s VPU width = 8 lanes or 4 lanes, depending on generation). docs.tenstorrent.com docs.tenstorrent.com
- Each core executes the same instruction stream (SPMD style) but uses only its local lanes. To keep them in lock-step, broadcast the control flow decisions (the “program counter”) via explicit barriers or very fast mesh synchronization.
Explicit Synchronization & Broadcast
- In CUDA, warp divergence is handled by an internal stack and warp-level barrier. On Tenstorrent, you must insert an explicit core-level barrier after the divergence test. For example, each core computes its local boolean predicate; then all cores exchange one bit of information (e.g., via a reduction or broadcast instruction) to decide if any lane took the “true” path. Only after sharing this single bit can every core agree on which sub-vector to execute next. EE Times Hacker News
- Tenstorrent exposes a low-latency “rendezvous” or “barrier” primitive in Metalium that blocks until all designated cores reach the same barrier ID. Use this to synchronize control flow across cores. EE Times
Memory Layout Across Cores
- Distribute input arrays (A, B, …) such that each core reads a contiguous chunk from DRAM into its L1 scratch via DMA. This is analogous to CUDA’s coalesced loads, but done explicitly by computing each core’s tile coordinates in the global index space. docs.tenstorrent.com docs.tenstorrent.com
- After computing, store each core’s partial outputs back to DRAM using DMA, again using explicit address calculations to avoid bank conflicts.

Example: Vector Dot-Product Over Four Cores
Imagine you want to compute a dot product of two length-1024 vectors using a warp of 32 “threads.” On Tenstorrent, choose 4 cores (each VPU width = 8 lanes on this generation).

Tile Assignment:
- Core (0,0) processes lanes 0–7 (indices 0..7).
- Core (0,1) processes lanes 8–15.
- Core (0,2) processes lanes 16–23.
- Core (0,3) processes lanes 24–31.

Kernel Sketch on Each Core:

// Each core: baseIdx = coreId * (warpSize/coreRowCount)
baseIdx = core_row * 8
// Loop over 1024/32 = 32 iterations to cover full vector:
for (i = 0; i < 32; ++i) {
  globalIdx = baseIdx + i * 32   // Each iteration jumps by warp size
  dma_load vA, [A + globalIdx]   // Load 8 elements into vector register
  dma_load vB, [B + globalIdx]   // Load 8 elements 
  vfmadd vAcc, vA, vB, vAcc      // Accumulate partial dot product
}
// Perform tree-reduce within VPU lanes to get a single scalar in lane 0:
vreduce_sum vSum, vAcc
// Now do an asynchronous global barrier to share partial results:
barrier_and_reduce_sum globalAcc, vSum
// After barrier, core(0,0) holds final result; others can be masked off.
if (core_row == 0 && core_col == 0) {
  store [DotResult], globalAcc
}

vmfmadd: fused multiply-add across vector lanes.
vreduce_sum: tree-reduce across the 8 lanes within the VPU.
barrier_and_reduce_sum: a hypothetical Metalium primitive that sums vSum from all participating cores (0,0–0,3) and broadcasts the final 32-bit scalar to each core’s local register. EE Times docs.tenstorrent.com

Pros & Cons of Multi-Core SIMT Mapping
- Pros:
  - Spreads the warp across multiple VPUs to use all cores, not just one.
  - When divergence is rare, cores can keep running largely in lock-step with minimal barrier overhead.
- Cons:
  - Each divergence requires a full inter-core barrier and mask recalculation—costlier than intra-warp predication on a real GPU.
  - Tightly coupling control flow across cores may reduce spatial reuse if one lane group diverges heavily.
  - Code complexity increases: one must explicitly orchestrate tile ownership, synchronization, and mask calculation for each branching point.

6. Strategy 3: Adopt Native Spatial Programming (Avoid SIMT Semantics Altogether)

Reframe Problem as Dataflow Over Core Grid
- Instead of forcing a SIMT mindset, decompose the algorithm into independent “operators” and pipeline them across cores. For example, for a convolution kernel, assign each 2D tile of the output to a small sub-grid of cores (e.g., a 2×2 block). Each core handles only its tile’s multiplications and reductions, communicating partial sums to neighbors via the mesh network. GitHub docs.tenstorrent.com
- Tenstorrent’s TT-MLIR dialect natively understands meshShape and tensor layouts—so rather than thinking “threads in a warp,” think “operators on a 2D grid.”
Use Metalium for Custom Kernels
- Write kernels in Metalium that directly issue vector and scalar instructions to each core. There is no warp abstraction; each core runs its own code path. If occasional synchronization is needed, insert explicit barrier() calls. EE Times GitHub
- Use BUDa’s performance analyzer to visualize how different operators tile onto the core grid, then apply compiler “placement overrides” to fix any hot spots or data reblocking inefficiencies. docs.tenstorrent.com YouTube
Advantages of Native Spatial Mapping
- No need to emulate warp-level divergence; each core can naturally follow its own control flow and only coordinate when data dependencies demand it.
- Memory and data movement are explicit, so there is greater predictability in latency and bandwidth usage.
- The architecture shines for workloads that are naturally expressible as 2D/3D tensor partitions (CNNs, transformers).
When to Avoid SIMT-Style Emulation
- If your kernel has highly irregular control flow (e.g., graph algorithms, tree traversals), forcing warp semantics will introduce excessive barriers and mask management. In such cases, write each core’s code to handle a subset of elements independently and only synchronize at the end of large phases.

7. Putting It All Together: A Practical Roadmap

Choose Your Level of Abstraction
- High-Level (TT-MLIR/TT-NN): Let the Tenstorrent compiler decide how to map tensor operations to the core grid. Simply describe your computation in MLIR or use TT-NN for standard layers (convs, matmuls). docs.tenstorrent.com
- Metalium (Bare-Metal): Write custom kernels when you need fine-grained control (e.g., specialized GEMM microkernels). Decide if you really need SIMT semantics or if a dataflow approach is better. EE Times
If You Must Emulate SIMT
- Within One Core (Strategy 1): Keep “warps” within a VPU. Use vector masking for divergence. Recompile kernels for each VPU width (32-lane vs. 64-lane).
- Across Multiple Cores (Strategy 2): Partition warps across a row or column of cores. Add explicit barriers to synchronize control flow. Distribute memory tiles carefully to ensure coalesced DMA.
Testing & Performance Tuning
- Use BUDA’s “placement report” to verify that your operators are balanced (i.e., each core does similar work) and that reblocking between producers and consumers is minimized. docs.tenstorrent.com
- Profile to check for stalls caused by excessive inter-core barriers. If barriers dominate, consider switching to a fully dataflow pattern where each core is more autonomous.
- Tune vector unrolling, tile sizes, and the shape of core grids to maximize local reuse and minimize mesh traffic.

8. Summary

Tenstorrent does not provide a native SIMT warp scheduler; instead, you must either:
1. Emulate warp-level parallelism inside a single VPU (using vector mask registers), or
2. Partition a warp across multiple cores and manage synchronization explicitly, or
3. Re-architect your algorithm in spatial/dataflow terms so that each core is given a distinct tile of work with minimal branching.
Which approach to pick depends on your application’s control-flow characteristics and performance priorities. For highly regular kernels (dense linear algebra, convolution), vector-lane emulation (Strategy 1) or even better, pure spatial tiling (Strategy 3) typically yields the best performance. For irregular kernels where warp divergence would be severe, avoid SIMT emulation and instead write independent code per core.

By following this roadmap—carefully selecting between single-core vector masking or multi-core synchronization, or better yet, native spatial tiling—you can successfully map a SIMT-style kernel onto Tenstorrent hardware and exploit its massive 2D core mesh for high throughput.