`matmul_t`, `matmul_t_acc`, `matmul_at`, `matmul_at_acc`, `matmul_at_swiglu_a`, `matmul_at_acc_swiglu_a` — Backward Matrix Multiply

File: 03_linear.wgsl (backward kernels) Pipeline step: The backbone of the backward pass. About 35% of the backward step time is spent in these kernels.

6 kernels:

matmul_t — Y = X @ W^T (for the dX computation)
matmul_t_acc — Y += X @ W^T (for gradient accumulation)
matmul_at — Y = X^T @ B (for the dW computation)
matmul_at_acc — Y += X^T @ B
matmul_at_swiglu_a — Y = (silu(GATE) * UP)^T @ B (FFN SwiGLU dW_down weight gradient computation)
matmul_at_acc_swiglu_a — Y += (silu(GATE) * UP)^T @ B

The _w16 mixed-precision variants (e.g., matmul_t_w16) live in the same file.

Wait, what is this?

Picture an audio mixing board. The incoming signal (X) runs through a bunch of knobs (the W weights), each channel blends in by some amount, and out comes the result (Y). That's exactly the forward pass: Y = X @ W. Then you listen to the output and go "nope, it's off by this much" — that error signal is dY.

Now you have to answer two separate questions on the way back, and they really are separate. First: "which way do I turn the knobs?" That's dW = X^T @ dY. For each knob, you line up the signal that fed into it with the error at the output, and you get "would nudging this knob down lower the error or raise it?" Second: "what do I report back to the person before me?" Because your input is really some other layer's output. That's dX = dY @ W^T — you push the error back through the knobs, toward the source.

Here's the intuitive bit: in the forward direction the multiply is a one-way blend. Going backward you transpose a different matrix for each question — X for dW, W for dX — because the two questions ask you to look at it from two different angles. In code that's just swapping the indices of a plain matrix multiply — but on a GPU the memory layout (who reads what, in what order) drives everything, so each transpose pattern earns its own kernel.

The 6 kernels you'll see below are the concrete form of those two questions plus a couple of practical variants (gradient accumulation, SwiGLU fusion). They all do the same core math; what differs is which matrix gets flipped and how memory is read.

Why 6 Separate Kernels?

While a single kernel (matmul) suffices for the forward pass Y = X @ W, the backward pass requires two distinct gradient computations:

dX = dY @ W^T — to propagate the error gradient to the previous layer (transposed weight)
dW = X^T @ dY — to compute the weight updates (transposed input)

Since these operations involve different transpose patterns, they cannot be computed with the forward kernel without disrupting the memory layout:

matmul:    Y[M, N] = X[M, K] @ W[K, N]      ← W normal
matmul_t:  Y[M, N] = X[M, K] @ W^T[N, K]    ← W transposed
matmul_at: Y[K, N] = X^T[M, K]^T @ B[M, N]  ← X transposed

The _acc variants provide += semantics for gradient accumulation.

The _swiglu_a variants, on the other hand, are a perfect fit with the FFN SwiGLU forward fusion. Since the hidden activation matrix (silu(gate)*up) is never materialized in memory during the forward pass, when computing the dW_down gradient in the backward layer this kernel loads the gate and up inputs directly, deriving the silu(gate)*up combination on-the-fly (at compute time) to complete the matrix multiply. Writing the intermediate matrix to memory and reading it back is fully avoided.

Vectorized Loading and Conflict-Free Addressing (Vec4 & Coalesced Transpose Load)

Vec4 Pass: Inputs are bound as array<vec4<f32>> so that 4 floats are fetched in a single 16-byte instruction. The constraints K % 4 == 0 and N % 4 == 0 apply.
Coalesced Load Swap: In the transpose matrix multiply, to prevent consecutive threads from making strides (jumps) across memory, the load index mappings (aIm, aIk) are swapped:
- matmul_at: performs the inner reduction over X^T (along the M axis). Thanks to the swap, consecutive threads load consecutive M values at a fixed K (consecutive 4-wide row reads instead of stride-K jumps). This fills the hardware L1 cache line in a fully coalesced manner (2-4 cache line loads per warp instead of 32).

Bind Group ABI

`matmul_t` (4 bindings)

Binding	Type	Detail
0	`storage, read`	`X: array<vec4<f32>>` — `[M × K/4]`
1	`storage, read`	`W: array<vec4<f32>>` — `[N × K/4]` (transposed view)
2	`storage, read_write`	`Y: array<f32>` — `[M × N]`
3	`uniform`	`dims: vec4<u32>` — `(M, N, K, _)`

`matmul_at` (4 bindings)

Binding	Type	Detail
0	`storage, read`	`X: array<vec4<f32>>` — `[M × K/4]`
1	`storage, read`	`B: array<vec4<f32>>` — `[M × N/4]` (dY upstream gradient)
2	`storage, read_write`	`Y: array<f32>` — `[K × N]` (dW output)
3	`uniform`	`dims: vec4<u32>` — `(K, N, M, _)`

`matmul_at_swiglu_a` (5 bindings)

Binding	Type	Detail
0	`storage, read`	`GATE_mas: array<vec4<f32>>` — forward gate output
1	`storage, read`	`UP_mas: array<vec4<f32>>` — forward up output
2	`storage, read`	`B_mas: array<vec4<f32>>` — dY upstream gradient
3	`storage, read_write`	`Y_mas: array<f32>` — `dW_down` output
4	`uniform`	`dims_mas: vec4<u32>` — `(K, N, M, _)`

Dispatch Shape

workgroup_size: (16, 16, 1) → 256 threads
matmul_t / matmul_t_acc:   grid = (ceil(N/64), ceil(M/64), 1)
matmul_at / matmul_at_swiglu_a: grid = (ceil(N/64), ceil(K/64), 1)

`matmul_at` — The Most Expensive Backward Kernel

In profile snapshots, matmul_at is the heaviest kernel:

Cost: About 18.4% of the total backward step time is spent in this kernel group alone.
FFN Dimensions: It becomes a bottleneck especially in the backward gemm operations of the FFN weight matrices (768×3072). SwiGLU fusion (matmul_at_swiglu_a) dramatically reduces this cost by eliminating the memory transfers.

Line by Line — `matmul_at_swiglu_a` Coalescing and Computation

The inner reduction over M and the on-the-fly SwiGLU resolution in the matmul_at_swiglu_a kernel:

1) Index Swap in the A-Tile Coalesced Load (Swap)

wgsl

    {
        let aI0 = tid * 4u;
        let aIm = aI0 / TM; let aIk = aI0 % TM;
        let kG = block_row + aIk; let mG = aIm;
        let m_in = mG < M;
        // gate ve up verilerini okur
        let gv = GATE_mas[mG * K4 + kG / 4u];
        let uv = UP_mas  [mG * K4 + kG / 4u];
        // SwiGLU'yu hesaplayarak A tile'ına coalesced (ardışık) yazar
        tileA_db[(aIk + 0u) * TK_PAD + aIm] = select(0.0, silu(gv.x) * uv.x, m_in && (kG + 0u) < K);
        tileA_db[(aIk + 1u) * TK_PAD + aIm] = select(0.0, silu(gv.y) * uv.y, m_in && (kG + 1u) < K);
        tileA_db[(aIk + 2u) * TK_PAD + aIm] = select(0.0, silu(gv.z) * uv.z, m_in && (kG + 2u) < K);
        tileA_db[(aIk + 3u) * TK_PAD + aIm] = select(0.0, silu(gv.w) * uv.w, m_in && (kG + 3u) < K);

2) Inner Reduction and FMA Loop

The inner dimension of the loop is over M (number of rows):

wgsl

        for (var m: u32 = 0u; m < TK; m = m + 1u) {
            let a0 = tileA_db[cur_a_off + (4u * ty + 0u) * TK_PAD + m];
            let a1 = tileA_db[cur_a_off + (4u * ty + 1u) * TK_PAD + m];
            let a2 = tileA_db[cur_a_off + (4u * ty + 2u) * TK_PAD + m];
            let a3 = tileA_db[cur_a_off + (4u * ty + 3u) * TK_PAD + m];
            let b0 = tileB_db[cur_b_off + m * TN + (4u * tx + 0u)];
            let b1 = tileB_db[cur_b_off + m * TN + (4u * tx + 1u)];
            let b2 = tileB_db[cur_b_off + m * TN + (4u * tx + 2u)];
            let b3 = tileB_db[cur_b_off + m * TN + (4u * tx + 3u)];
            acc00 = fma(a0, b0, acc00); acc01 = fma(a0, b1, acc01); acc02 = fma(a0, b2, acc02); acc03 = fma(a0, b3, acc03);
            acc10 = fma(a1, b0, acc10); acc11 = fma(a1, b1, acc11); acc12 = fma(a1, b2, acc12); acc13 = fma(a1, b3, acc13);
            acc20 = fma(a2, b0, acc20); acc21 = fma(a2, b1, acc21); acc22 = fma(a2, b2, acc22); acc23 = fma(a2, b3, acc23);
            acc30 = fma(a3, b0, acc30); acc31 = fma(a3, b1, acc31); acc32 = fma(a3, b2, acc32); acc33 = fma(a3, b3, acc33);
        }

Code Review

Finding 1: Backward FP32 Precision Is Mandatory

Risk	Description
🟢 architecture	fp16 is not used in the backward step. If gradients were fp16, the Adam moments would accumulate noisily, breaking training stability. By design: while forward uses f16 storage, the backward and optimizer steps are pure fp32.

Quick Checklist

Test Scenario	Status
Are the `matmul_at` and `matmul_at_swiglu_a` output dimensions correct?	✅ `[K × N]` (dW output)
Does the coalesced swap work at the hardware level?	✅ Verified with Apple Instruments
Did the SwiGLU backward fusion remove the need for an intermediate `hidden`?	✅ Yes, resolved directly from the gate and up data

12_optimizer.md — AdamW. Multi-tensor mega-buffer and fused fp16 mirroring optimizations. The final step of the pipeline.

#matmul_t, matmul_t_acc, matmul_at, matmul_at_acc, matmul_at_swiglu_a, matmul_at_acc_swiglu_a — Backward Matrix Multiply

#Wait, what is this?

#Why 6 Separate Kernels?

#Vectorized Loading and Conflict-Free Addressing (Vec4 & Coalesced Transpose Load)

#Bind Group ABI

#matmul_t (4 bindings)

#matmul_at (4 bindings)

#matmul_at_swiglu_a (5 bindings)

#Dispatch Shape

#matmul_at — The Most Expensive Backward Kernel

#Line by Line — matmul_at_swiglu_a Coalescing and Computation

#1) Index Swap in the A-Tile Coalesced Load (Swap)

#2) Inner Reduction and FMA Loop

#Code Review

#Finding 1: Backward FP32 Precision Is Mandatory

#Quick Checklist

#Next