00_shared.wgsl — Shared Preamble (The Base of All Kernels)
File: 00_shared.wgsl Pipeline step: none — This is a preamble file, not a kernel. The engine injects it at the start of every kernel module.
Wait, what is this?
Picture a project with 30 separate files, each one starting with the exact same #defines, the same helper functions, the same constants. You wouldn't copy-paste all of it into every file by hand — you'd write a prelude.h and #include it at the top, done. This file is exactly that "prelude", with one catch: WGSL has no #include.
So the engine pulls a crude-but-effective trick: right before compiling each kernel, it pastes the contents of this file, as plain text, onto the front of the kernel code. So when you call wg_reduce_sum, you haven't imported or linked anything — that function has physically been spliced into your file. It behaves as if it were a language built-in, but it's really shared text copied into every kernel separately.
What's inside? A handful of constants (WG=256, NEG_INF, and friends), a couple of tiny helpers (flat_id collapses a 2D dispatch into a single linear index, nan_guard zeroes out blown-up numbers), and the real heavy lifting that every kernel leans on: subgroup-accelerated workgroup reductions (wg_reduce_sum, wg_reduce_max). These take "boil 256 threads' values down to one sum/max" and do it using the GPU's SIMD lanes, about 50% faster than a classic tree reduction.
Here's the interesting part: because this "paste it on top" approach is the only valid route, it spawns a bunch of subtle rules — the file must carry no kernel marker, enable directives must sit at the very top, subgroup functions must be called under "uniform control flow". Each of those gets unpacked below.
What Does It Do?
It defines the common preamble shared by all WGSL kernels. Before each kernel is compiled, the engine prepends the contents of 00_shared.wgsl to the very top. So everything defined here:
- Constants (
WG,NEG_INF,SUBGROUP_SIZE,NUM_SUBGROUPS) - Helper functions (
flat_id,is_finite,nan_guard,wg_reduce_sum,wg_reduce_max) - Shared workgroup memory (
sh_red)
…is visible inside every kernel.
Important rule: This file must not contain a kernel marker (
// --- KERNEL: name ---). If it does, the engine subjects this file to kernel splitting and the preamble property breaks.
Engine Side — How Is It Injected?
The init() function in engine.js:
let sharedPreamble = await loadShader(SHARED_PREAMBLE_PATH, import.meta.url);
if (/^\/\/ --- KERNEL:/m.test(sharedPreamble)) {
throw new Error(`${SHARED_PREAMBLE_PATH} must not contain kernel markers`);
}
sharedPreamble = 'enable f16;\nenable subgroups;\n' + sharedPreamble;So the final source of each kernel ends up like this:
enable f16;
enable subgroups;
// ─── 00_shared.wgsl içeriği ───
const WG: u32 = 256u;
fn wg_reduce_sum(tid: u32, val: f32) -> f32 { ... }
// ... (rest of preamble)
// ─── per-file preamble (e.g., constants from 03_linear.wgsl) ───
const TM: u32 = 64u;
var<workgroup> tileA: array<f32, 1088>;
// ─── kernel marker'dan sonraki kod ───
@compute @workgroup_size(...)
fn matmul(...) { ... }This is the only valid method in WGSL, which has no "include" system — string concatenation.
Contents
A. Constants
const WG: u32 = 256u;
const MAX_WG_DIM: u32 = 65535u;
const NEG_INF: f32 = -3.4028234e38;
const F32_MAX: f32 = 3.4028234e38;
const SUBGROUP_SIZE: u32 = 32u;
const NUM_SUBGROUPS: u32 = 8u; // WG / SUBGROUP_SIZE| Constant | Value | Description |
|---|---|---|
WG | 256 | Workgroup size (threads per WG). Used in all 1D dispatch kernels. |
MAX_WG_DIM | 65535 | The max workgroup count WebGPU allows in 1D. If exceeded, falls back to 2D (see flat_id). |
NEG_INF | -3.40e38 | For masking in softmax/attention. A large negative literal instead of ±Inf — WGSL has no f32::MIN literal. |
F32_MAX | +3.40e38 | Reference for is_finite. WGSL has no isInf, so we do our own finite-check. |
SUBGROUP_SIZE | 32 | SIMD lane count. Guaranteed 32 on Apple/NVIDIA/Intel/AMD-RDNA (Chrome subgroups feature). |
NUM_SUBGROUPS | 8 | WG / SUBGROUP_SIZE. How many subgroups are in a workgroup. |
Why is SUBGROUP_SIZE a constant? Even though WebGPU has the
subgroupSizeruntime built-in, our code assumes 32 at compile time. Correct for Apple/NVIDIA. AMD RDNA also reports 32 in Chrome (CDNA is 64, but CDNA isn't available in Chrome). If you ever run on a 64-lane GPU someday, the reduction count would be wrong — this is a deliberate hardware constraint.
B. flat_id — Linearization for 2D dispatch fallback
fn flat_id(gid: vec3<u32>, nwg: vec3<u32>) -> u32 {
return gid.x + gid.y * nwg.x * WG;
}Why does it exist? WebGPU's dispatchWorkgroups(x, y, z) gives a max of 65535 workgroups in 1D. On a 200M-token corpus (200M / 256 = 781K WG) this limit is exceeded. The solution: use a 2D grid.
engine.dispatch1D(...) makes this decision host-side:
- WG count ≤ 65535 →
(N, 1, 1)1D - Otherwise →
(65535, ceil(N/65535), 1)2D
The kernel must run without noticing either case. flat_id reduces each thread to a single global ID:
flat_id = gid.x + gid.y * (nwg.x * WG)
= thread_in_x_row + (row_index * threads_per_row)gid.x and nwg.x are the thread/workgroup coordinate in WG units, but the linear ID is wanted in thread units — the × WG multiplication performs this conversion.
Important: The kernel calling flat_id must produce a thread ID concatenated with local_invocation_index; not just the workgroup ID. Each kernel handles that detail internally.
C. NaN/Inf Guard
fn is_finite(x: f32) -> bool {
return (x == x) && (abs(x) < F32_MAX);
}
fn nan_guard(x: f32) -> f32 {
return select(0.0, x, is_finite(x));
}WGSL has no isInf/isNaN built-ins. Self-defined:
(x == x)→falsefor NaN,truefor everything else. NaN's one defining property: it is not equal to itself.abs(x) < F32_MAX→±Infis excluded, finite floats accepted.
nan_guard(x):
- returns x if x is finite
- returns 0 if NaN/Inf
Where it's used: At the edges of the backward pass — for example in the attention_backward output, so that if a blow-up happens during Q×K^T, we zero out the downstream gradient rather than contaminating the pipeline with NaN.
D. Subgroup-accelerated workgroup reduction
Shared scratch
var<workgroup> sh_red: array<f32, 256>;256-entry workgroup-shared memory. Only the first NUM_SUBGROUPS = 8 slots carry meaningful data, the rest is padding. Why 256? Because all kernels share this preamble; some may index sh_red[tid] and write up to 256 (e.g. a helper other than wg_reduce_*). A defensive size.
wg_reduce_sum
fn wg_reduce_sum(tid: u32, val: f32) -> f32 {
let sg_sum = subgroupAdd(val);
let sg_id = tid / SUBGROUP_SIZE;
let lane = tid % SUBGROUP_SIZE;
if (lane == 0u) { sh_red[sg_id] = sg_sum; }
workgroupBarrier();
var v: f32 = 0.0;
if (lane < NUM_SUBGROUPS) { v = sh_red[lane]; }
let result = subgroupAdd(v);
workgroupBarrier();
return result;
}Algorithm — 2-level subgroup reduction:
- Phase 1:
subgroupAddwithin each subgroup → 8 separate subgroup sums (the same value in every thread of each subgroup) - Lane 0 writes each subgroup's sum to shared memory (8 writes)
- Barrier — so shared memory is consistent
- Phase 2: Every subgroup loads the first 8 values and runs
subgroupAddwithin itself - Trailing Barrier:
sh_redis a common scratch area shared by allwg_reduce_*calls. Without a secondworkgroupBarrier();right after the final read, a fast subgroup could return early from the function and begin the next reduction step, overwriting thesh_red[sg_id]cell before a slower subgroup has finished itsv = sh_red[lane]read from shared memory. This barrier prevents consecutive reductions (e.g. thewg_reduce_max->wg_reduce_sumtransitions inside attention and cross_entropy) from racing with each other (a race condition). - Result: Every thread sees the same sum.
Why does every subgroup do redundant work in phase 2?
WGSL's uniform control flow rule: subgroupAdd and other subgroup operations must be called under uniform CF. That is:
if (sg_id == 0u) { // ← non-uniform!
let total = subgroupAdd(...); // ← compile error
}To do this, the sg_id == 0u condition depends on tid / SUBGROUP_SIZE, which depends on local_invocation_index, which is different for every thread → non-uniform, fail.
The solution: let all subgroups run phase 2, all loading the same 8 values and calling the same subgroupAdd. Redundant computation but uniform CF, which is acceptable.
The cost: 7× redundant subgroupAdd. But each one is very cheap (8 lanes × 1 op), and worth it compared to the barriers.
Classic alternative (the one done on Apple Metal):
8 iteration tree reduction with 8 barriers + log2(256) = 8 ALU opsThe subgroup version: 2 subgroup ops + 2 barriers = ~50% faster reduction.
wg_reduce_max
Same structure and trailing-barrier logic, subgroupAdd → subgroupMax, identity 0.0 → NEG_INF.
fn wg_reduce_max(tid: u32, val: f32) -> f32 {
let sg_max = subgroupMax(val);
let sg_id = tid / SUBGROUP_SIZE;
let lane = tid % SUBGROUP_SIZE;
if (lane == 0u) { sh_red[sg_id] = sg_max; }
workgroupBarrier();
var v: f32 = NEG_INF;
if (lane < NUM_SUBGROUPS) { v = sh_red[lane]; }
let result = subgroupMax(v);
workgroupBarrier();
return result;
}The NEG_INF identity matters — in phase 2, threads where lane >= NUM_SUBGROUPS must have v = NEG_INF; otherwise, if v = 0.0, it could wrongly pull the max-sum toward 0 (if all sums are negative).
Where the Engine Adds 00_shared.wgsl
The splitKernels function of engine.js:
function splitKernels(source, fileLabel, sharedPreamble = '') {
const marker = /^\/\/ --- KERNEL: (\S+) ---$/gm;
const matches = [...source.matchAll(marker)];
if (matches.length === 0) {
throw new Error(`No kernel markers in ${fileLabel}`);
}
const filePreamble = source.slice(0, matches[0].index);
const preamble = sharedPreamble + '\n' + filePreamble;
const kernels = {};
for (let i = 0; i < matches.length; i++) {
const name = matches[i][1];
const start = matches[i].index + matches[i][0].length;
const end = matches[i + 1]?.index ?? source.length;
kernels[name] = preamble + source.slice(start, end);
}
return kernels;
}So the final source for a kernel = enable f16;\nenable subgroups;\n + 00_shared.wgsl contents + \n + the file's portion before the kernel marker (file preamble) + kernel code.
Result: wg_reduce_sum, flat_id, nan_guard, etc. are callable in every kernel as if they were built-ins of the language.
WGSL-Specific Notes
1. enable directives must be at the start of the file
WGSL spec: enable directives must be the first non-whitespace token. That's why the engine injects enable f16; and enable subgroups; even AHEAD of the shared preamble:
enable f16; ← must be the first line
enable subgroups;
// ─── 00_shared.wgsl içeriği ───
const WG: u32 = 256u;
...If we had put enable f16; inside 00_shared.wgsl and the engine prepended something on top of this file, compilation would error out.
2. var<workgroup> limitations
sh_red: array<f32, 256> gets a separate copy for each kernel (per workgroup). On Apple GPUs the workgroup memory limit is ~32 KB; sh_red = 1 KB is small. But if a kernel already uses 30 KB of workgroup memory internally (e.g. matmul's tileA + tileB), sh_red's extra 1 KB can noticeably lower occupancy. WGSL does not warn about this at compile time; you need to know it.
3. Subgroup-uniform CF — silent compile error
This is exactly the rationale for the "redundant phase 2" inside our wg_reduce_sum code. If a newcomer wrote "subgroupAdd inside sg_id == 0u", the Tint compiler would throw a "uniformity analysis failed" error. This is the standard Metal/CUDA pattern — WGSL is stricter.
4. select(false_val, true_val, cond) — ternary
WGSL has no ternary operator cond ? a : b. select(b, a, cond) is used instead. The order is select(false_branch, true_branch, condition) — similar to Metal's select(false, true, cond), but watch out: it differs from C/C++.
Code Review
Finding 1: SUBGROUP_SIZE hardcoded — breaks on 64-lanes
| Risk | Description |
|---|---|
| 🟡 medium | SUBGROUP_SIZE = 32u is a compile-time constant. But AMD CDNA (data-center GPU) uses 64-lane waves. WebGPU's subgroups feature isn't officially supported on CDNA in Chrome, but it could be in the future. |
Mitigation: If a 64-lane GPU is added, wg_reduce_* would need to use the subgroupSize runtime built-in. No problem on the targets the code currently runs on (Apple, NVIDIA desktop, Intel).
Finding 2: sh_red could be 8 instead of 256
| Risk | Description |
|---|---|
| 🟢 none but worth reviewing | sh_red[NUM_SUBGROUPS] is enough; 256 entries are defensive sizing. A 1 KB workgroup memory waste (small within a total of 32 KB). But good to know. |
Finding 3: F32_MAX literal — compiler tolerance
| Risk | Description |
|---|---|
| 🟢 none | The literal 3.4028234e38f — some compilers may not be bit-by-bit identical to f32::MAX's exact value (3.4028234663852886e+38 exact). In practice the difference does not affect functional behavior. |
Quick Checklist
| Test Scenario | Status |
|---|---|
| Is the preamble injected into every kernel? | ✅ engine.js:206 check |
| No kernel marker present? | ✅ explicit check (engine.js:206) |
Does wg_reduce_sum produce the correct sum of 256 elements? | ⚠ no test |
nan_guard(NaN) → 0 verified? | ⚠ no unit test |
| Does subgroup uniform CF avoid a compile error? | ✅ seen at runtime |
Is flat_id(2D dispatch) indexing correct? | ✅ hot path in matmul kernels |
Next
01_embedding.md — token ID → vector conversion, the model's first forward step.