llm.istanbul — WGSL Kernel Studies

LLM training from scratch on WebGPU: a detailed analysis of every kernel written in WGSL.

One forward pass, one backward pass, one optimizer step — the entire training loop, line by line.

What is an LLM?

At its simplest, an LLM (Large Language Model) is a machine that predicts the next word. Think of the autocomplete on your phone keyboard — but on steroids. You type "the capital of France is" and it has learned to answer "Paris", because it has digested millions of sentences and soaked up the statistics of "after these words, this one most likely comes next". Chat, translation, writing code — all of it is built on top of that single skill: what comes next?

So how does it pull that off? First we chop the text into little pieces (tokens), then turn each token into a vector of numbers, as we see in 01 Embedding. Those vectors flow through a stack of transformer layers: in each layer the words look at one another and gather context (05 Attention), then each word "thinks" a little on its own (the FFN — 03 Linear, 06 Activation). At the very end the model produces, for every position, a probability distribution over "the next token might be this".

Here's the learning part: we compare the model's guess against the actual next word and measure how wrong it was with a single number (the loss — 07 Cross-Entropy). Then we push that error backward through the whole network (the backward pass) and, for every weight, work out "which way should I nudge you to lower the error?". The optimizer (12 AdamW) then moves each weight a hair in that direction. Repeat that millions of times and the model slowly "gets" language.

And the punchline: all of this grand machinery is really just a giant pile of matrix multiplies plus a handful of helper ops. No magic — just a lot of numbers being multiplied and summed in parallel on the GPU. These studies unpack exactly those ops — embedding lookup, normalization, matrix multiply, attention, activation, loss, gradients, optimizer — one at a time, kernel by kernel. So you'll find the answer to "how does an LLM actually work?" not in theory, but in every line of running code.

The ordering below follows the model's natural flow too: predicting a token (forward), measuring how wrong it was, and correcting (backward + optimizer).

This series is the WebGPU/WGSL counterpart of the MetalFoundry studies under docs/learning/. The algorithms are the same; the language and runtime differ:

Dimension	Metal	WGSL
Runtime	Apple GPU (native)	browser (Chrome/Edge)
Language	Metal Shading Language	WebGPU Shading Language
Memory model	unified (CPU+GPU)	mostly-uniform
Threadgroup syntax	`threadgroup`	`var<workgroup>`
Barrier	`threadgroup_barrier`	`workgroupBarrier()`
Atomics	`atomic_*` (built-in)	`atomic*<u32>`, `atomicAdd`, etc.
f16 support	default	`enable f16;` extension
Subgroup	SIMD-group native	`enable subgroups;` extension

How Should It Be Read?

Every study follows the same structure:

What Does It Do? — The kernel's purpose, in a single sentence
Mathematical Definition — Formulas
Bind Group ABI — What lives in which binding slot (in WGSL, @group(0) @binding(N))
Dispatch Shape — Workgroup size and grid
Line-by-Line Explanation — The code, piece by piece
WGSL-Specific Notes — Differences from Metal, language features
Code Review — Risk Analysis — Findings table (if any)
Quick Checklist — Test coverage / sanity

The ordering follows the model's flow of "predicting" and "learning" a token.

0. Infrastructure — Preamble, Helpers, Constants

First of all: the contract and helper kernels that all kernels share.

#	Study	File	Summary
00a	Shared Preamble	00_shared.wgsl	The engine injects this at the head of every kernel: WG size, reduction, NaN guard, char_class.
00b	Infrastructure Kernels	00_infrastructure.wgsl	Glue kernels: `fill_zero`, `fill_const`, `scale`, `axpy`, `copy`, `clamp_inplace`.

I. Forward Pass — From Token to Prediction

Input: a sequence of token IDs. Output: the answer to the question "what could the next token be?"

#	Study	Shader	Summary
01	Embedding Lookup	01_embedding.wgsl	Token ID → vector. f32 and f16 (`_w16`) variants.
02	RMS Norm	02_norm.wgsl	Normalize the vector by its L2 norm. Subgroup reduction.
03	Linear Forward	03_linear.wgsl (forward kernels)	`Y = X @ W` — 64×64 double-buffered tile, vec4 loads, fused SwiGLU forward.
04	RoPE	04_rope.wgsl	Rotate Q/K by position. Forward + backward (involution).
05	Attention Forward	05_attention.wgsl	Online softmax + GQA + KV cache + decode + seg cross-document masking.
06	Activation	06_activation.wgsl	GeLU, SwiGLU combine (stable branched SiLU).
07	Cross-Entropy Loss	07_loss.wgsl	Log-softmax + NLL fused. The start of the backward pass.
08	F16 Cast	08_cast.wgsl	f32 ↔ f16. Standalone mixed precision conversion (the cast has been retired).

II. Backward Pass — From Prediction Error to Correction

Input: the gradient of the loss. Output: how much each weight needs to change.

#	Study	Shader	Summary
09	Backward FFN	09_backward_ffn.wgsl	GeLU and SwiGLU activation backward (stable sigmoid derivative).
10	Backward Attention	10_backward_attention.wgsl	3 variants: streaming, split-short, split-dKdV + seg cross-document masking.
11	Backward Linear	03_linear.wgsl (backward kernels)	`dW = X^T @ dY`, `dX = dY @ W^T`. matmul_t, _at, fused SwiGLU backward.

III. Optimizer — Weight Update

#	Study	Shader	Summary
12	AdamW Optimizer	12_optimizer.wgsl	Multi-tensor AdamW + fused fp16 mirroring (`adamw_update_f16`). fp32 + 8-bit variant.

IV. BPE Tokenizer — Word Splitting and Compression

A standalone GPU tokenizer pipeline that converts the raw text data fed into the language model into token IDs, and decodes and merges tokens back during inference.

#	Study	Shader	Summary
13	BPE Tokenizer Pipeline	bpe.wgsl	Word boundary detection, 2-level local hash count, subgroup-cooperative scan, and Trie-based greedy longest-match.

Pipeline Flow Diagram

                               ┌──── allocate (bir kere) ────┐
                               ▼                              │
tokens ──► 01 Embed ──► 02 Norm ──► 03 Linear ──► 04 RoPE ──┤
                                                              │
                                                   05 Attention
                                                              │
                                                   06 Activation (FFN)
                                                              │
                               ┌────── 02 Norm ◄── residual ◄┘
                               ▼
                      03 Linear (lm_head) ──► 07 Cross-Entropy
                                                   │  dLogits (fp32)
            ╔═══════════════════════════════════ │ ════════════════╗
            ║                BACKWARD PASS                          ║
            ║  11 Linear ← 09 FFN ← 10 Attn (split) ← 02 Norm bwd ║
            ║                                       ← 01 Embed bwd ║
            ╚═══════════════════════════════════════════════════════╝
                                  │
                       reduce_norm_sq → finalize_grad_stats
                                  │
                           12 AdamW (fused weight mirroring) ──► LOOP ↑

Totals

Metric	Value
Shader files	14 (00–13, merged with w16/8bit/BPE variants)
Study files	15 + index
Pipeline coverage	100%
Forward kernels	11 unique
Backward kernels	13 unique
Optimizer kernels	5 unique (3 fp32, 1 8bit, 1 grad-stat)
Tokenizer kernels	14 unique (training and inference)

A Few Differences Between WGSL and Metal

Things you need to get used to in WGSL:

1. `enable` directives in the preamble

WGSL "enable" directives must be at the head of the file. The engine injects enable f16;\nenable subgroups; at the very start of 00_shared.wgsl, then appends the rest of the preamble and the kernel code.

2. Workgroup size in the kernel attribute

wgsl

@compute @workgroup_size(256, 1, 1)
fn my_kernel(...) { ... }

In Metal, [[max_total_threads_per_threadgroup(N)]] is merely an optimization hint; in WGSL it is mandatory in the kernel signature.

3. Bind group + binding (instead of slot)

wgsl

@group(0) @binding(0) var<storage, read> X: array<f32>;
@group(0) @binding(1) var<storage, read_write> Y: array<f32>;

In Metal, [[buffer(N)]]. In WGSL there are two levels: @group (similar to a descriptor set) + @binding (slot). In our code we only use @group(0).

4. Atomic type wrap

wgsl

var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);

An atomic value must be tagged at the type level. In Metal, atomic_* operations work directly on ints.

5. f16 usage is limited

Even if you enable f16;, f16 operations are more limited. In our code f16 is used only as a storage type; computation is still f32 (cast-load → compute → cast-store).

6. Kernel splitting (`// --- KERNEL: name ---`)

WGSL can host more than one @compute kernel in a single file, but our engine splits each file by kernel markers and builds a separate pipeline state for each kernel. The reason: enable directives and var<workgroup> declarations do not leak into another kernel.

7. Subgroup operations

With enable subgroups;, cross-thread reductions such as subgroupAdd() and subgroupBroadcast() become available — but on Apple GPUs they sit behind a Chrome flag and a wave size = 32 is assumed. This is a runtime feature flag and depends on adapter support.

Notes

Bind group layout is not dynamic — each kernel's bind group layout is fixed at shader compile time; only the buffer bindings can be changed at dispatch time.
Multi-tensor AdamW difference — the Metal version also does multi-tensor, but WGSL requires sub-range binding support via entries[].resource.{offset, size}; this is a standard WebGPU feature.
No F16 backward — the backward pass is entirely fp32. Mixed precision is active only in the forward pass via the _w16 kernels.
Batch=1 guaranteed — because real batch support over WebGPU in the browser is hard, all kernels are designed for a single sequence; gradient accumulation is done host-side.

Source Code Location

bpe/src/llm/
├── engine.js              ← WebGPU device + pipeline compilation + bind helpers
├── model.js               ← Model class orchestrator (Object.assign wiring)
├── model/
│   ├── layout.js          ← paramLayout, sub-range, bufOf/offsetOf, FNV
│   ├── alloc.js           ← allocate (mega_w), allocActivations, allocBackward
│   ├── forward.js         ← _encodeLayerForward, forwardLoss
│   ├── backward.js        ← backward (split attn dispatch lives here)
│   ├── optim.js           ← multi-tensor AdamW step
│   ├── decode.js          ← allocKVCache, prefillCache, decodeOne, predictLogitsAt
│   └── checkpoint.js      ← export/import + setStepCount + destroy
└── shaders/               ← The subject of these studies

Next stop: 00_shared.md →

#llm.istanbul — WGSL Kernel Studies

#What is an LLM?

#How Should It Be Read?

#0. Infrastructure — Preamble, Helpers, Constants

#I. Forward Pass — From Token to Prediction

#II. Backward Pass — From Prediction Error to Correction

#III. Optimizer — Weight Update

#IV. BPE Tokenizer — Word Splitting and Compression

#Pipeline Flow Diagram

#Totals

#A Few Differences Between WGSL and Metal

#1. enable directives in the preamble

#2. Workgroup size in the kernel attribute

#3. Bind group + binding (instead of slot)

#4. Atomic type wrap

#5. f16 usage is limited

#6. Kernel splitting (// --- KERNEL: name ---)

#7. Subgroup operations

#Notes

#Source Code Location

llm.istanbul — WGSL Kernel Studies

What is an LLM?

How Should It Be Read?

0. Infrastructure — Preamble, Helpers, Constants

I. Forward Pass — From Token to Prediction

II. Backward Pass — From Prediction Error to Correction

III. Optimizer — Weight Update

IV. BPE Tokenizer — Word Splitting and Compression

Pipeline Flow Diagram

Totals

A Few Differences Between WGSL and Metal

1. `enable` directives in the preamble

2. Workgroup size in the kernel attribute

3. Bind group + binding (instead of slot)

4. Atomic type wrap

5. f16 usage is limited

6. Kernel splitting (`// --- KERNEL: name ---`)

7. Subgroup operations

Notes

Source Code Location