llm.istanbul — WGSL Kernel Studies
LLM training from scratch on WebGPU: a detailed analysis of every kernel written in WGSL.
One forward pass, one backward pass, one optimizer step — the entire training loop, line by line.
What is an LLM?
At its simplest, an LLM (Large Language Model) is a machine that predicts the next word. Think of the autocomplete on your phone keyboard — but on steroids. You type "the capital of France is" and it has learned to answer "Paris", because it has digested millions of sentences and soaked up the statistics of "after these words, this one most likely comes next". Chat, translation, writing code — all of it is built on top of that single skill: what comes next?
So how does it pull that off? First we chop the text into little pieces (tokens), then turn each token into a vector of numbers, as we see in 01 Embedding. Those vectors flow through a stack of transformer layers: in each layer the words look at one another and gather context (05 Attention), then each word "thinks" a little on its own (the FFN — 03 Linear, 06 Activation). At the very end the model produces, for every position, a probability distribution over "the next token might be this".
Here's the learning part: we compare the model's guess against the actual next word and measure how wrong it was with a single number (the loss — 07 Cross-Entropy). Then we push that error backward through the whole network (the backward pass) and, for every weight, work out "which way should I nudge you to lower the error?". The optimizer (12 AdamW) then moves each weight a hair in that direction. Repeat that millions of times and the model slowly "gets" language.
And the punchline: all of this grand machinery is really just a giant pile of matrix multiplies plus a handful of helper ops. No magic — just a lot of numbers being multiplied and summed in parallel on the GPU. These studies unpack exactly those ops — embedding lookup, normalization, matrix multiply, attention, activation, loss, gradients, optimizer — one at a time, kernel by kernel. So you'll find the answer to "how does an LLM actually work?" not in theory, but in every line of running code.
The ordering below follows the model's natural flow too: predicting a token (forward), measuring how wrong it was, and correcting (backward + optimizer).
This series is the WebGPU/WGSL counterpart of the MetalFoundry studies under docs/learning/. The algorithms are the same; the language and runtime differ:
| Dimension | Metal | WGSL |
|---|---|---|
| Runtime | Apple GPU (native) | browser (Chrome/Edge) |
| Language | Metal Shading Language | WebGPU Shading Language |
| Memory model | unified (CPU+GPU) | mostly-uniform |
| Threadgroup syntax | threadgroup | var<workgroup> |
| Barrier | threadgroup_barrier | workgroupBarrier() |
| Atomics | atomic_* (built-in) | atomic*<u32>, atomicAdd, etc. |
| f16 support | default | enable f16; extension |
| Subgroup | SIMD-group native | enable subgroups; extension |
How Should It Be Read?
Every study follows the same structure:
- What Does It Do? — The kernel's purpose, in a single sentence
- Mathematical Definition — Formulas
- Bind Group ABI — What lives in which binding slot (in WGSL,
@group(0) @binding(N)) - Dispatch Shape — Workgroup size and grid
- Line-by-Line Explanation — The code, piece by piece
- WGSL-Specific Notes — Differences from Metal, language features
- Code Review — Risk Analysis — Findings table (if any)
- Quick Checklist — Test coverage / sanity
The ordering follows the model's flow of "predicting" and "learning" a token.
0. Infrastructure — Preamble, Helpers, Constants
First of all: the contract and helper kernels that all kernels share.
| # | Study | File | Summary |
|---|---|---|---|
| 00a | Shared Preamble | 00_shared.wgsl | The engine injects this at the head of every kernel: WG size, reduction, NaN guard, char_class. |
| 00b | Infrastructure Kernels | 00_infrastructure.wgsl | Glue kernels: fill_zero, fill_const, scale, axpy, copy, clamp_inplace. |
I. Forward Pass — From Token to Prediction
Input: a sequence of token IDs. Output: the answer to the question "what could the next token be?"
| # | Study | Shader | Summary |
|---|---|---|---|
| 01 | Embedding Lookup | 01_embedding.wgsl | Token ID → vector. f32 and f16 (_w16) variants. |
| 02 | RMS Norm | 02_norm.wgsl | Normalize the vector by its L2 norm. Subgroup reduction. |
| 03 | Linear Forward | 03_linear.wgsl (forward kernels) | Y = X @ W — 64×64 double-buffered tile, vec4 loads, fused SwiGLU forward. |
| 04 | RoPE | 04_rope.wgsl | Rotate Q/K by position. Forward + backward (involution). |
| 05 | Attention Forward | 05_attention.wgsl | Online softmax + GQA + KV cache + decode + seg cross-document masking. |
| 06 | Activation | 06_activation.wgsl | GeLU, SwiGLU combine (stable branched SiLU). |
| 07 | Cross-Entropy Loss | 07_loss.wgsl | Log-softmax + NLL fused. The start of the backward pass. |
| 08 | F16 Cast | 08_cast.wgsl | f32 ↔ f16. Standalone mixed precision conversion (the cast has been retired). |
II. Backward Pass — From Prediction Error to Correction
Input: the gradient of the loss. Output: how much each weight needs to change.
| # | Study | Shader | Summary |
|---|---|---|---|
| 09 | Backward FFN | 09_backward_ffn.wgsl | GeLU and SwiGLU activation backward (stable sigmoid derivative). |
| 10 | Backward Attention | 10_backward_attention.wgsl | 3 variants: streaming, split-short, split-dKdV + seg cross-document masking. |
| 11 | Backward Linear | 03_linear.wgsl (backward kernels) | dW = X^T @ dY, dX = dY @ W^T. matmul_t, _at, fused SwiGLU backward. |
III. Optimizer — Weight Update
| # | Study | Shader | Summary |
|---|---|---|---|
| 12 | AdamW Optimizer | 12_optimizer.wgsl | Multi-tensor AdamW + fused fp16 mirroring (adamw_update_f16). fp32 + 8-bit variant. |
IV. BPE Tokenizer — Word Splitting and Compression
A standalone GPU tokenizer pipeline that converts the raw text data fed into the language model into token IDs, and decodes and merges tokens back during inference.
| # | Study | Shader | Summary |
|---|---|---|---|
| 13 | BPE Tokenizer Pipeline | bpe.wgsl | Word boundary detection, 2-level local hash count, subgroup-cooperative scan, and Trie-based greedy longest-match. |
Pipeline Flow Diagram
┌──── allocate (bir kere) ────┐
▼ │
tokens ──► 01 Embed ──► 02 Norm ──► 03 Linear ──► 04 RoPE ──┤
│
05 Attention
│
06 Activation (FFN)
│
┌────── 02 Norm ◄── residual ◄┘
▼
03 Linear (lm_head) ──► 07 Cross-Entropy
│ dLogits (fp32)
╔═══════════════════════════════════ │ ════════════════╗
║ BACKWARD PASS ║
║ 11 Linear ← 09 FFN ← 10 Attn (split) ← 02 Norm bwd ║
║ ← 01 Embed bwd ║
╚═══════════════════════════════════════════════════════╝
│
reduce_norm_sq → finalize_grad_stats
│
12 AdamW (fused weight mirroring) ──► LOOP ↑Totals
| Metric | Value |
|---|---|
| Shader files | 14 (00–13, merged with w16/8bit/BPE variants) |
| Study files | 15 + index |
| Pipeline coverage | 100% |
| Forward kernels | 11 unique |
| Backward kernels | 13 unique |
| Optimizer kernels | 5 unique (3 fp32, 1 8bit, 1 grad-stat) |
| Tokenizer kernels | 14 unique (training and inference) |
A Few Differences Between WGSL and Metal
Things you need to get used to in WGSL:
1. enable directives in the preamble
WGSL "enable" directives must be at the head of the file. The engine injects enable f16;\nenable subgroups; at the very start of 00_shared.wgsl, then appends the rest of the preamble and the kernel code.
2. Workgroup size in the kernel attribute
@compute @workgroup_size(256, 1, 1)
fn my_kernel(...) { ... }In Metal, [[max_total_threads_per_threadgroup(N)]] is merely an optimization hint; in WGSL it is mandatory in the kernel signature.
3. Bind group + binding (instead of slot)
@group(0) @binding(0) var<storage, read> X: array<f32>;
@group(0) @binding(1) var<storage, read_write> Y: array<f32>;In Metal, [[buffer(N)]]. In WGSL there are two levels: @group (similar to a descriptor set) + @binding (slot). In our code we only use @group(0).
4. Atomic type wrap
var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);An atomic value must be tagged at the type level. In Metal, atomic_* operations work directly on ints.
5. f16 usage is limited
Even if you enable f16;, f16 operations are more limited. In our code f16 is used only as a storage type; computation is still f32 (cast-load → compute → cast-store).
6. Kernel splitting (// --- KERNEL: name ---)
WGSL can host more than one @compute kernel in a single file, but our engine splits each file by kernel markers and builds a separate pipeline state for each kernel. The reason: enable directives and var<workgroup> declarations do not leak into another kernel.
7. Subgroup operations
With enable subgroups;, cross-thread reductions such as subgroupAdd() and subgroupBroadcast() become available — but on Apple GPUs they sit behind a Chrome flag and a wave size = 32 is assumed. This is a runtime feature flag and depends on adapter support.
Notes
- Bind group layout is not dynamic — each kernel's bind group layout is fixed at shader compile time; only the buffer bindings can be changed at dispatch time.
- Multi-tensor AdamW difference — the Metal version also does multi-tensor, but WGSL requires sub-range binding support via
entries[].resource.{offset, size}; this is a standard WebGPU feature. - No F16 backward — the backward pass is entirely fp32. Mixed precision is active only in the forward pass via the
_w16kernels. - Batch=1 guaranteed — because real batch support over WebGPU in the browser is hard, all kernels are designed for a single sequence; gradient accumulation is done host-side.
Source Code Location
bpe/src/llm/
├── engine.js ← WebGPU device + pipeline compilation + bind helpers
├── model.js ← Model class orchestrator (Object.assign wiring)
├── model/
│ ├── layout.js ← paramLayout, sub-range, bufOf/offsetOf, FNV
│ ├── alloc.js ← allocate (mega_w), allocActivations, allocBackward
│ ├── forward.js ← _encodeLayerForward, forwardLoss
│ ├── backward.js ← backward (split attn dispatch lives here)
│ ├── optim.js ← multi-tensor AdamW step
│ ├── decode.js ← allocKVCache, prefillCache, decodeOne, predictLogitsAt
│ └── checkpoint.js ← export/import + setStepCount + destroy
└── shaders/ ← The subject of these studiesNext stop: 00_shared.md →