`00_shared.wgsl` — Shared Preamble (The Base of All Kernels)

File: 00_shared.wgsl Pipeline step: none — This is a preamble file, not a kernel. The engine injects it at the start of every kernel module.

Wait, what is this?

Picture a project with 30 separate files, each one starting with the exact same #defines, the same helper functions, the same constants. You wouldn't copy-paste all of it into every file by hand — you'd write a prelude.h and #include it at the top, done. This file is exactly that "prelude", with one catch: WGSL has no #include.

So the engine pulls a crude-but-effective trick: right before compiling each kernel, it pastes the contents of this file, as plain text, onto the front of the kernel code. So when you call wg_reduce_sum, you haven't imported or linked anything — that function has physically been spliced into your file. It behaves as if it were a language built-in, but it's really shared text copied into every kernel separately.

What's inside? A handful of constants (WG=256, NEG_INF, and friends), a couple of tiny helpers (flat_id collapses a 2D dispatch into a single linear index, nan_guard zeroes out blown-up numbers), and the real heavy lifting that every kernel leans on: subgroup-accelerated workgroup reductions (wg_reduce_sum, wg_reduce_max). These take "boil 256 threads' values down to one sum/max" and do it using the GPU's SIMD lanes, about 50% faster than a classic tree reduction.

Here's the interesting part: because this "paste it on top" approach is the only valid route, it spawns a bunch of subtle rules — the file must carry no kernel marker, enable directives must sit at the very top, subgroup functions must be called under "uniform control flow". Each of those gets unpacked below.

What Does It Do?

It defines the common preamble shared by all WGSL kernels. Before each kernel is compiled, the engine prepends the contents of 00_shared.wgsl to the very top. So everything defined here:

Constants (WG, NEG_INF, SUBGROUP_SIZE, NUM_SUBGROUPS)
Helper functions (flat_id, is_finite, nan_guard, wg_reduce_sum, wg_reduce_max)
Shared workgroup memory (sh_red)

…is visible inside every kernel.

Important rule: This file must not contain a kernel marker (// --- KERNEL: name ---). If it does, the engine subjects this file to kernel splitting and the preamble property breaks.

Engine Side — How Is It Injected?

The init() function in engine.js:

javascript

let sharedPreamble = await loadShader(SHARED_PREAMBLE_PATH, import.meta.url);
if (/^\/\/ --- KERNEL:/m.test(sharedPreamble)) {
    throw new Error(`${SHARED_PREAMBLE_PATH} must not contain kernel markers`);
}
sharedPreamble = 'enable f16;\nenable subgroups;\n' + sharedPreamble;

So the final source of each kernel ends up like this:

wgsl

enable f16;
enable subgroups;
// ─── 00_shared.wgsl içeriği ───
const WG: u32 = 256u;
fn wg_reduce_sum(tid: u32, val: f32) -> f32 { ... }
// ... (rest of preamble)
// ─── per-file preamble (e.g., constants from 03_linear.wgsl) ───
const TM: u32 = 64u;
var<workgroup> tileA: array<f32, 1088>;
// ─── kernel marker'dan sonraki kod ───
@compute @workgroup_size(...)
fn matmul(...) { ... }

This is the only valid method in WGSL, which has no "include" system — string concatenation.

A. Constants

wgsl

const WG: u32 = 256u;
const MAX_WG_DIM: u32 = 65535u;
const NEG_INF: f32 = -3.4028234e38;
const F32_MAX: f32 = 3.4028234e38;

const SUBGROUP_SIZE: u32 = 32u;
const NUM_SUBGROUPS: u32 = 8u;   // WG / SUBGROUP_SIZE

Constant	Value	Description
`WG`	256	Workgroup size (threads per WG). Used in all 1D dispatch kernels.
`MAX_WG_DIM`	65535	The max workgroup count WebGPU allows in 1D. If exceeded, falls back to 2D (see `flat_id`).
`NEG_INF`	-3.40e38	For masking in softmax/attention. A large negative literal instead of ±Inf — WGSL has no `f32::MIN` literal.
`F32_MAX`	+3.40e38	Reference for `is_finite`. WGSL has no `isInf`, so we do our own finite-check.
`SUBGROUP_SIZE`	32	SIMD lane count. Guaranteed 32 on Apple/NVIDIA/Intel/AMD-RDNA (Chrome `subgroups` feature).
`NUM_SUBGROUPS`	8	`WG / SUBGROUP_SIZE`. How many subgroups are in a workgroup.

Why is SUBGROUP_SIZE a constant? Even though WebGPU has the subgroupSize runtime built-in, our code assumes 32 at compile time. Correct for Apple/NVIDIA. AMD RDNA also reports 32 in Chrome (CDNA is 64, but CDNA isn't available in Chrome). If you ever run on a 64-lane GPU someday, the reduction count would be wrong — this is a deliberate hardware constraint.

B. `flat_id` — Linearization for 2D dispatch fallback

wgsl

fn flat_id(gid: vec3<u32>, nwg: vec3<u32>) -> u32 {
    return gid.x + gid.y * nwg.x * WG;
}

Why does it exist? WebGPU's dispatchWorkgroups(x, y, z) gives a max of 65535 workgroups in 1D. On a 200M-token corpus (200M / 256 = 781K WG) this limit is exceeded. The solution: use a 2D grid.

engine.dispatch1D(...) makes this decision host-side:

WG count ≤ 65535 → (N, 1, 1) 1D
Otherwise → (65535, ceil(N/65535), 1) 2D

The kernel must run without noticing either case. flat_id reduces each thread to a single global ID:

flat_id = gid.x + gid.y * (nwg.x * WG)
        = thread_in_x_row + (row_index * threads_per_row)

gid.x and nwg.x are the thread/workgroup coordinate in WG units, but the linear ID is wanted in thread units — the × WG multiplication performs this conversion.

Important: The kernel calling flat_id must produce a thread ID concatenated with local_invocation_index; not just the workgroup ID. Each kernel handles that detail internally.

C. NaN/Inf Guard

wgsl

fn is_finite(x: f32) -> bool {
    return (x == x) && (abs(x) < F32_MAX);
}

fn nan_guard(x: f32) -> f32 {
    return select(0.0, x, is_finite(x));
}

WGSL has no isInf/isNaN built-ins. Self-defined:

(x == x) → false for NaN, true for everything else. NaN's one defining property: it is not equal to itself.
abs(x) < F32_MAX → ±Inf is excluded, finite floats accepted.

nan_guard(x):

returns x if x is finite
returns 0 if NaN/Inf

Where it's used: At the edges of the backward pass — for example in the attention_backward output, so that if a blow-up happens during Q×K^T, we zero out the downstream gradient rather than contaminating the pipeline with NaN.

D. Subgroup-accelerated workgroup reduction

Shared scratch

wgsl

var<workgroup> sh_red: array<f32, 256>;

256-entry workgroup-shared memory. Only the first NUM_SUBGROUPS = 8 slots carry meaningful data, the rest is padding. Why 256? Because all kernels share this preamble; some may index sh_red[tid] and write up to 256 (e.g. a helper other than wg_reduce_*). A defensive size.

`wg_reduce_sum`

wgsl

fn wg_reduce_sum(tid: u32, val: f32) -> f32 {
    let sg_sum = subgroupAdd(val);
    let sg_id  = tid / SUBGROUP_SIZE;
    let lane   = tid % SUBGROUP_SIZE;

    if (lane == 0u) { sh_red[sg_id] = sg_sum; }
    workgroupBarrier();

    var v: f32 = 0.0;
    if (lane < NUM_SUBGROUPS) { v = sh_red[lane]; }
    let result = subgroupAdd(v);
    workgroupBarrier();
    return result;
}

Algorithm — 2-level subgroup reduction:

Phase 1: subgroupAdd within each subgroup → 8 separate subgroup sums (the same value in every thread of each subgroup)
Lane 0 writes each subgroup's sum to shared memory (8 writes)
Barrier — so shared memory is consistent
Phase 2: Every subgroup loads the first 8 values and runs subgroupAdd within itself
Trailing Barrier: sh_red is a common scratch area shared by all wg_reduce_* calls. Without a second workgroupBarrier(); right after the final read, a fast subgroup could return early from the function and begin the next reduction step, overwriting the sh_red[sg_id] cell before a slower subgroup has finished its v = sh_red[lane] read from shared memory. This barrier prevents consecutive reductions (e.g. the wg_reduce_max -> wg_reduce_sum transitions inside attention and cross_entropy) from racing with each other (a race condition).
Result: Every thread sees the same sum.

Why does every subgroup do redundant work in phase 2?

WGSL's uniform control flow rule: subgroupAdd and other subgroup operations must be called under uniform CF. That is:

wgsl

if (sg_id == 0u) {        // ← non-uniform!
    let total = subgroupAdd(...);  // ← compile error
}

To do this, the sg_id == 0u condition depends on tid / SUBGROUP_SIZE, which depends on local_invocation_index, which is different for every thread → non-uniform, fail.

The solution: let all subgroups run phase 2, all loading the same 8 values and calling the same subgroupAdd. Redundant computation but uniform CF, which is acceptable.

The cost: 7× redundant subgroupAdd. But each one is very cheap (8 lanes × 1 op), and worth it compared to the barriers.

Classic alternative (the one done on Apple Metal):

8 iteration tree reduction with 8 barriers + log2(256) = 8 ALU ops

The subgroup version: 2 subgroup ops + 2 barriers = ~50% faster reduction.

`wg_reduce_max`

Same structure and trailing-barrier logic, subgroupAdd → subgroupMax, identity 0.0 → NEG_INF.

wgsl

fn wg_reduce_max(tid: u32, val: f32) -> f32 {
    let sg_max = subgroupMax(val);
    let sg_id  = tid / SUBGROUP_SIZE;
    let lane   = tid % SUBGROUP_SIZE;

    if (lane == 0u) { sh_red[sg_id] = sg_max; }
    workgroupBarrier();

    var v: f32 = NEG_INF;
    if (lane < NUM_SUBGROUPS) { v = sh_red[lane]; }
    let result = subgroupMax(v);
    workgroupBarrier();
    return result;
}

The NEG_INF identity matters — in phase 2, threads where lane >= NUM_SUBGROUPS must have v = NEG_INF; otherwise, if v = 0.0, it could wrongly pull the max-sum toward 0 (if all sums are negative).

Where the Engine Adds `00_shared.wgsl`

The splitKernels function of engine.js:

javascript

function splitKernels(source, fileLabel, sharedPreamble = '') {
    const marker = /^\/\/ --- KERNEL: (\S+) ---$/gm;
    const matches = [...source.matchAll(marker)];
    if (matches.length === 0) {
        throw new Error(`No kernel markers in ${fileLabel}`);
    }
    const filePreamble = source.slice(0, matches[0].index);
    const preamble = sharedPreamble + '\n' + filePreamble;
    const kernels = {};
    for (let i = 0; i < matches.length; i++) {
        const name = matches[i][1];
        const start = matches[i].index + matches[i][0].length;
        const end = matches[i + 1]?.index ?? source.length;
        kernels[name] = preamble + source.slice(start, end);
    }
    return kernels;
}

So the final source for a kernel = enable f16;\nenable subgroups;\n + 00_shared.wgsl contents + \n + the file's portion before the kernel marker (file preamble) + kernel code.

Result: wg_reduce_sum, flat_id, nan_guard, etc. are callable in every kernel as if they were built-ins of the language.

WGSL-Specific Notes

1. `enable` directives must be at the start of the file

WGSL spec: enable directives must be the first non-whitespace token. That's why the engine injects enable f16; and enable subgroups; even AHEAD of the shared preamble:

wgsl

enable f16;            ← must be the first line
enable subgroups;
// ─── 00_shared.wgsl içeriği ───
const WG: u32 = 256u;
...

If we had put enable f16; inside 00_shared.wgsl and the engine prepended something on top of this file, compilation would error out.

2. `var<workgroup>` limitations

sh_red: array<f32, 256> gets a separate copy for each kernel (per workgroup). On Apple GPUs the workgroup memory limit is ~32 KB; sh_red = 1 KB is small. But if a kernel already uses 30 KB of workgroup memory internally (e.g. matmul's tileA + tileB), sh_red's extra 1 KB can noticeably lower occupancy. WGSL does not warn about this at compile time; you need to know it.

3. Subgroup-uniform CF — silent compile error

This is exactly the rationale for the "redundant phase 2" inside our wg_reduce_sum code. If a newcomer wrote "subgroupAdd inside sg_id == 0u", the Tint compiler would throw a "uniformity analysis failed" error. This is the standard Metal/CUDA pattern — WGSL is stricter.

4. `select(false_val, true_val, cond)` — ternary

WGSL has no ternary operator cond ? a : b. select(b, a, cond) is used instead. The order is select(false_branch, true_branch, condition) — similar to Metal's select(false, true, cond), but watch out: it differs from C/C++.

Code Review

Finding 1: SUBGROUP_SIZE hardcoded — breaks on 64-lanes

Risk	Description
🟡 medium	`SUBGROUP_SIZE = 32u` is a compile-time constant. But AMD CDNA (data-center GPU) uses 64-lane waves. WebGPU's `subgroups` feature isn't officially supported on CDNA in Chrome, but it could be in the future.

Mitigation: If a 64-lane GPU is added, wg_reduce_* would need to use the subgroupSize runtime built-in. No problem on the targets the code currently runs on (Apple, NVIDIA desktop, Intel).

Finding 2: `sh_red` could be 8 instead of 256

Risk	Description
🟢 none but worth reviewing	`sh_red[NUM_SUBGROUPS]` is enough; 256 entries are defensive sizing. A 1 KB workgroup memory waste (small within a total of 32 KB). But good to know.

Finding 3: F32_MAX literal — compiler tolerance

Risk	Description
🟢 none	The literal `3.4028234e38f` — some compilers may not be bit-by-bit identical to `f32::MAX`'s exact value (`3.4028234663852886e+38` exact). In practice the difference does not affect functional behavior.

Quick Checklist

Test Scenario	Status
Is the preamble injected into every kernel?	✅ engine.js:206 check
No kernel marker present?	✅ explicit check (engine.js:206)
Does `wg_reduce_sum` produce the correct sum of 256 elements?	⚠ no test
`nan_guard(NaN)` → 0 verified?	⚠ no unit test
Does subgroup uniform CF avoid a compile error?	✅ seen at runtime
Is `flat_id(2D dispatch)` indexing correct?	✅ hot path in matmul kernels

01_embedding.md — token ID → vector conversion, the model's first forward step.

`00_shared.wgsl` — Shared Preamble (The Base of All Kernels)

Wait, what is this?

What Does It Do?

Engine Side — How Is It Injected?

Contents

A. Constants

B. `flat_id` — Linearization for 2D dispatch fallback

C. NaN/Inf Guard

D. Subgroup-accelerated workgroup reduction

Shared scratch

`wg_reduce_sum`

`wg_reduce_max`

Where the Engine Adds `00_shared.wgsl`

WGSL-Specific Notes

1. `enable` directives must be at the start of the file

2. `var<workgroup>` limitations

3. Subgroup-uniform CF — silent compile error

4. `select(false_val, true_val, cond)` — ternary

Code Review

Finding 1: SUBGROUP_SIZE hardcoded — breaks on 64-lanes

Finding 2: `sh_red` could be 8 instead of 256

Finding 3: F32_MAX literal — compiler tolerance

Quick Checklist

Next

#00_shared.wgsl — Shared Preamble (The Base of All Kernels)

#Wait, what is this?

#What Does It Do?

#Engine Side — How Is It Injected?

#Contents

#A. Constants

#B. flat_id — Linearization for 2D dispatch fallback

#C. NaN/Inf Guard

#D. Subgroup-accelerated workgroup reduction

#Shared scratch

#wg_reduce_sum

#wg_reduce_max

#Where the Engine Adds 00_shared.wgsl

#WGSL-Specific Notes

#1. enable directives must be at the start of the file

#2. var<workgroup> limitations

#3. Subgroup-uniform CF — silent compile error

#4. select(false_val, true_val, cond) — ternary

#Code Review

#Finding 1: SUBGROUP_SIZE hardcoded — breaks on 64-lanes

#Finding 2: sh_red could be 8 instead of 256

#Finding 3: F32_MAX literal — compiler tolerance

#Quick Checklist

#Next

`00_shared.wgsl` — Shared Preamble (The Base of All Kernels)

Wait, what is this?

What Does It Do?

Engine Side — How Is It Injected?

Contents

A. Constants

B. `flat_id` — Linearization for 2D dispatch fallback

C. NaN/Inf Guard

D. Subgroup-accelerated workgroup reduction

Shared scratch

`wg_reduce_sum`

`wg_reduce_max`

Where the Engine Adds `00_shared.wgsl`

WGSL-Specific Notes

1. `enable` directives must be at the start of the file

2. `var<workgroup>` limitations

3. Subgroup-uniform CF — silent compile error

4. `select(false_val, true_val, cond)` — ternary

Code Review

Finding 1: SUBGROUP_SIZE hardcoded — breaks on 64-lanes

Finding 2: `sh_red` could be 8 instead of 256

Finding 3: F32_MAX literal — compiler tolerance

Quick Checklist

Next