llm.istanbul·Study
TR EN
Workbench →

llm.istanbul — WGSL Kernel Studies

LLM training from scratch on WebGPU: a detailed analysis of every kernel written in WGSL.

One forward pass, one backward pass, one optimizer step — the entire training loop, line by line.

What is an LLM?

At its simplest, an LLM (Large Language Model) is a machine that predicts the next word. Think of the autocomplete on your phone keyboard — but on steroids. You type "the capital of France is" and it has learned to answer "Paris", because it has digested millions of sentences and soaked up the statistics of "after these words, this one most likely comes next". Chat, translation, writing code — all of it is built on top of that single skill: what comes next?

So how does it pull that off? First we chop the text into little pieces (tokens), then turn each token into a vector of numbers, as we see in 01 Embedding. Those vectors flow through a stack of transformer layers: in each layer the words look at one another and gather context (05 Attention), then each word "thinks" a little on its own (the FFN — 03 Linear, 06 Activation). At the very end the model produces, for every position, a probability distribution over "the next token might be this".

Here's the learning part: we compare the model's guess against the actual next word and measure how wrong it was with a single number (the loss07 Cross-Entropy). Then we push that error backward through the whole network (the backward pass) and, for every weight, work out "which way should I nudge you to lower the error?". The optimizer (12 AdamW) then moves each weight a hair in that direction. Repeat that millions of times and the model slowly "gets" language.

And the punchline: all of this grand machinery is really just a giant pile of matrix multiplies plus a handful of helper ops. No magic — just a lot of numbers being multiplied and summed in parallel on the GPU. These studies unpack exactly those ops — embedding lookup, normalization, matrix multiply, attention, activation, loss, gradients, optimizer — one at a time, kernel by kernel. So you'll find the answer to "how does an LLM actually work?" not in theory, but in every line of running code.

The ordering below follows the model's natural flow too: predicting a token (forward), measuring how wrong it was, and correcting (backward + optimizer).


This series is the WebGPU/WGSL counterpart of the MetalFoundry studies under docs/learning/. The algorithms are the same; the language and runtime differ:

DimensionMetalWGSL
RuntimeApple GPU (native)browser (Chrome/Edge)
LanguageMetal Shading LanguageWebGPU Shading Language
Memory modelunified (CPU+GPU)mostly-uniform
Threadgroup syntaxthreadgroupvar<workgroup>
Barrierthreadgroup_barrierworkgroupBarrier()
Atomicsatomic_* (built-in)atomic*<u32>, atomicAdd, etc.
f16 supportdefaultenable f16; extension
SubgroupSIMD-group nativeenable subgroups; extension

How Should It Be Read?

Every study follows the same structure:

  1. What Does It Do? — The kernel's purpose, in a single sentence
  2. Mathematical Definition — Formulas
  3. Bind Group ABI — What lives in which binding slot (in WGSL, @group(0) @binding(N))
  4. Dispatch Shape — Workgroup size and grid
  5. Line-by-Line Explanation — The code, piece by piece
  6. WGSL-Specific Notes — Differences from Metal, language features
  7. Code Review — Risk Analysis — Findings table (if any)
  8. Quick Checklist — Test coverage / sanity

The ordering follows the model's flow of "predicting" and "learning" a token.


0. Infrastructure — Preamble, Helpers, Constants

First of all: the contract and helper kernels that all kernels share.

#StudyFileSummary
00aShared Preamble00_shared.wgslThe engine injects this at the head of every kernel: WG size, reduction, NaN guard, char_class.
00bInfrastructure Kernels00_infrastructure.wgslGlue kernels: fill_zero, fill_const, scale, axpy, copy, clamp_inplace.

I. Forward Pass — From Token to Prediction

Input: a sequence of token IDs. Output: the answer to the question "what could the next token be?"

#StudyShaderSummary
01Embedding Lookup01_embedding.wgslToken ID → vector. f32 and f16 (_w16) variants.
02RMS Norm02_norm.wgslNormalize the vector by its L2 norm. Subgroup reduction.
03Linear Forward03_linear.wgsl (forward kernels)Y = X @ W — 64×64 double-buffered tile, vec4 loads, fused SwiGLU forward.
04RoPE04_rope.wgslRotate Q/K by position. Forward + backward (involution).
05Attention Forward05_attention.wgslOnline softmax + GQA + KV cache + decode + seg cross-document masking.
06Activation06_activation.wgslGeLU, SwiGLU combine (stable branched SiLU).
07Cross-Entropy Loss07_loss.wgslLog-softmax + NLL fused. The start of the backward pass.
08F16 Cast08_cast.wgslf32 ↔ f16. Standalone mixed precision conversion (the cast has been retired).

II. Backward Pass — From Prediction Error to Correction

Input: the gradient of the loss. Output: how much each weight needs to change.

#StudyShaderSummary
09Backward FFN09_backward_ffn.wgslGeLU and SwiGLU activation backward (stable sigmoid derivative).
10Backward Attention10_backward_attention.wgsl3 variants: streaming, split-short, split-dKdV + seg cross-document masking.
11Backward Linear03_linear.wgsl (backward kernels)dW = X^T @ dY, dX = dY @ W^T. matmul_t, _at, fused SwiGLU backward.

III. Optimizer — Weight Update

#StudyShaderSummary
12AdamW Optimizer12_optimizer.wgslMulti-tensor AdamW + fused fp16 mirroring (adamw_update_f16). fp32 + 8-bit variant.

IV. BPE Tokenizer — Word Splitting and Compression

A standalone GPU tokenizer pipeline that converts the raw text data fed into the language model into token IDs, and decodes and merges tokens back during inference.

#StudyShaderSummary
13BPE Tokenizer Pipelinebpe.wgslWord boundary detection, 2-level local hash count, subgroup-cooperative scan, and Trie-based greedy longest-match.

Pipeline Flow Diagram

                               ┌──── allocate (bir kere) ────┐
                               ▼                              │
tokens ──► 01 Embed ──► 02 Norm ──► 03 Linear ──► 04 RoPE ──┤
                                                              │
                                                   05 Attention
                                                              │
                                                   06 Activation (FFN)
                                                              │
                               ┌────── 02 Norm ◄── residual ◄┘
                               ▼
                      03 Linear (lm_head) ──► 07 Cross-Entropy
                                                   │  dLogits (fp32)
            ╔═══════════════════════════════════ │ ════════════════╗
            ║                BACKWARD PASS                          ║
            ║  11 Linear ← 09 FFN ← 10 Attn (split) ← 02 Norm bwd ║
            ║                                       ← 01 Embed bwd ║
            ╚═══════════════════════════════════════════════════════╝
                                  │
                       reduce_norm_sq → finalize_grad_stats
                                  │
                           12 AdamW (fused weight mirroring) ──► LOOP ↑

Totals

MetricValue
Shader files14 (00–13, merged with w16/8bit/BPE variants)
Study files15 + index
Pipeline coverage100%
Forward kernels11 unique
Backward kernels13 unique
Optimizer kernels5 unique (3 fp32, 1 8bit, 1 grad-stat)
Tokenizer kernels14 unique (training and inference)

A Few Differences Between WGSL and Metal

Things you need to get used to in WGSL:

1. enable directives in the preamble

WGSL "enable" directives must be at the head of the file. The engine injects enable f16;\nenable subgroups; at the very start of 00_shared.wgsl, then appends the rest of the preamble and the kernel code.

2. Workgroup size in the kernel attribute

wgsl
@compute @workgroup_size(256, 1, 1)
fn my_kernel(...) { ... }

In Metal, [[max_total_threads_per_threadgroup(N)]] is merely an optimization hint; in WGSL it is mandatory in the kernel signature.

3. Bind group + binding (instead of slot)

wgsl
@group(0) @binding(0) var<storage, read> X: array<f32>;
@group(0) @binding(1) var<storage, read_write> Y: array<f32>;

In Metal, [[buffer(N)]]. In WGSL there are two levels: @group (similar to a descriptor set) + @binding (slot). In our code we only use @group(0).

4. Atomic type wrap

wgsl
var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);

An atomic value must be tagged at the type level. In Metal, atomic_* operations work directly on ints.

5. f16 usage is limited

Even if you enable f16;, f16 operations are more limited. In our code f16 is used only as a storage type; computation is still f32 (cast-load → compute → cast-store).

6. Kernel splitting (// --- KERNEL: name ---)

WGSL can host more than one @compute kernel in a single file, but our engine splits each file by kernel markers and builds a separate pipeline state for each kernel. The reason: enable directives and var<workgroup> declarations do not leak into another kernel.

7. Subgroup operations

With enable subgroups;, cross-thread reductions such as subgroupAdd() and subgroupBroadcast() become available — but on Apple GPUs they sit behind a Chrome flag and a wave size = 32 is assumed. This is a runtime feature flag and depends on adapter support.


Notes

  • Bind group layout is not dynamic — each kernel's bind group layout is fixed at shader compile time; only the buffer bindings can be changed at dispatch time.
  • Multi-tensor AdamW difference — the Metal version also does multi-tensor, but WGSL requires sub-range binding support via entries[].resource.{offset, size}; this is a standard WebGPU feature.
  • No F16 backward — the backward pass is entirely fp32. Mixed precision is active only in the forward pass via the _w16 kernels.
  • Batch=1 guaranteed — because real batch support over WebGPU in the browser is hard, all kernels are designed for a single sequence; gradient accumulation is done host-side.

Source Code Location

bpe/src/llm/
├── engine.js              ← WebGPU device + pipeline compilation + bind helpers
├── model.js               ← Model class orchestrator (Object.assign wiring)
├── model/
│   ├── layout.js          ← paramLayout, sub-range, bufOf/offsetOf, FNV
│   ├── alloc.js           ← allocate (mega_w), allocActivations, allocBackward
│   ├── forward.js         ← _encodeLayerForward, forwardLoss
│   ├── backward.js        ← backward (split attn dispatch lives here)
│   ├── optim.js           ← multi-tensor AdamW step
│   ├── decode.js          ← allocKVCache, prefillCache, decodeOne, predictLogitsAt
│   └── checkpoint.js      ← export/import + setStepCount + destroy
└── shaders/               ← The subject of these studies

Next stop: 00_shared.md

WGSL kernel studies · an LLM from scratch on WebGPUBuilt in Istanbul by Uğur Toprakdeviren.