llm.istanbul·Study
TR EN
Workbench →

How to Use — Driving the Workbench End to End

From raw text to a talking model: set up the tokenizer, train the model, generate output — and what every setting actually does.

The rest of the studies open up the kernels line by line. This page is the practical side: in the llm.istanbul workbench, which button does what, which parameter changes what. An end-to-end tour for training and running a small Turkish model from scratch.

The big picture — three tabs, two files

The workbench is three tabs; two kinds of files travel between them:

[raw text .txt]
   │  Tokenizer tab    →  BPE vocab + tokenization
   ▼
[.bin]   token sequence + embedded vocab
   │  Train tab        →  forward / backward / AdamW
   ▼
[.llm]   model weights + embedded vocab
   │  Generate tab     →  continue the prompt
   ▼
[generated text]
  • Tokenizer → splits text into tokens, builds the BPE vocabulary, emits a .bin.
  • Train → takes the .bin, trains the transformer, saves a .llm checkpoint.
  • Generate → takes the .llm, continues the prompt you write.

An important detail: both .bin and .llm carry the vocab embedded inside. So even if you share a .llm file on its own, the vocabulary comes along with it; you don't need to ship a separate vocab file.


Step 1 — Tokenizer (set up the vocabulary)

The model sees tokens, not letters — word pieces (why and how: 01 Embedding, 13 BPE). The first job is to set up the vocabulary that splits your text into tokens.

  1. Drop in the text files (.txt, .md, …). There can be several; "browse folder" scans the folder recursively.
  2. Pick a vocab size — the number of tokens in the vocabulary.
VocabWhen
8K–16KSmall corpus, narrow domain
32KGood balance for Turkish ⭐
64K (default)Wide corpus; the embedding table grows too
Tip

Turkish is agglutinative — "ev-ler-imiz-den". A very small vocab shreds the suffixes apart; a very large vocab wastes parameters on rare tokens. For a medium-to-large corpus, 32K is the sweet spot; for toy experiments, 8K–16K is enough.

  1. "Train BPE & Export .bin" — runs the merges on the GPU, then tokenizes the whole text and downloads a .bin. If you want, you can also save the vocabulary separately as .json, but the .bin already carries the vocab inside.
Important

Pretrain and fine-tune must be done with the same vocab. If the vocabulary changes, the embedding table won't match the old checkpoint and the model breaks. Set up the vocabulary once for a corpus, then always use that one.


Step 2 — Train (train the model)

The heart of the work. First you pick the model's shape (how many parameters it will have), then how it will learn (the training settings).

The easiest path: start with a preset

Instead of tweaking dozens of settings one by one, pick a preset, then fine-tune. The three presets stamp out the following:

PresetWhat forlrstepsgrad_accumwarmupmin_lrweight_decaygrad_clip
Fresh pretrainTraining from scratch3e-450k45%10%0.11.0
Continue pretrainResume from a checkpoint1e-430k410%10%0.10.5
Fine-tuneAdapt a ready model to a task5e-510k21%30%0.010.5
Note

The tab's default chips (d=128, 4 heads, 4 layers, seq=128, 500 steps) are deliberately toy/sanity sized — to see "does it work?" in a few seconds. When training a real model, switch to the "Ready recipe" below or to a preset.

Model size — how many parameters?

d_model — hidden size

The width of the model's "thinking space"; each token turns into a vector of this size (01 Embedding).

ValueUse
64–256Toy / sanity
384–512Small production model ⭐
768+Deep semantics, slow

The effect is squared: all matrices grow with d_model² — d=512 is ~1.8× the compute/memory of d=384.

n_heads — number of attention heads

Parallel "perspectives"; each head attends to a different relationship in the text (05 Attention). Rule: head_dim = d_model / n_heads, and 32–128 is a good range. d=512 → n_heads=8 → head_dim=64 ✓.

n_kv_heads — GQA grouping

Reduces the number of Key/Value heads to save memory. = n_heads (MHA, the highest quality but memory hungry), 2/4 (GQA — 4–5× KV memory savings, negligible quality loss ⭐), 1 (MQA, aggressive). Llama, Mistral, and Gemma all use GQA.

n_layers — depth

The number of transformer blocks; the model's "thinking steps". Linear effect: 8 layers = 2× the memory/time of 4 layers. For a small model, 6–8 is a good balance.

d_ff mult — FFN expansion ratio

The FFN layer is d_model × ratio wide (d=512 × 4 = 2048). is the modern standard (06 Activation).

activation

GeLU (classic, 2 matrices) or SwiGLU (modern, Llama-style, 3 matrices — gate/up/down, a bit pricier but higher quality ⭐). Details: 06 Activation.

Training settings — how should it learn?

seq_len — context window

The token length of one example. Since attention is O(seq²), 1024 is ~4× as pricey as 512 (05 Attention). For Q&A and dialogue, 512 is more than enough; for a toy, 128–256.

lr — learning rate

The optimizer's step size; with a cosine schedule it climbs to peak and falls. 3e-4 is the "magic number" for the Adam family (pretrain ⭐). 1e-4 is safe/slow (fine-tune). 1e-3+ risks blowing up. In fine-tune, use 1/3–1/10 of the pretrain lr.

steps — number of steps

The total number of optimizer steps. Sanity 100, quick experiment 2k, serious pretrain 50k–100k. Fine-tune is far fewer (1k–3k) — on a small corpus, too many steps = memorization.

grad_accum and the "batch" question

Important

This engine processes a single sequence in the browser — that is, batch = 1, there is no real batch. The "virtual batch" comes from grad_accum: it accumulates the gradients of N micro-steps (11 Backward Linear) and does a single optimizer update.

So tokens per step = seq_len × grad_accum. (The "batch=32" assumption that shows up in some general notes does not apply to this engine.)

tokens_per_step = seq_len × grad_accum
epoch_steps     = corpus_tokens / tokens_per_step
epochs_seen     = steps × tokens_per_step / corpus_tokens

Example: a 50M-token corpus, seq_len=512, grad_accum=4 → 2048 tokens per step → 50k steps ≈ 102M tokens seen ≈ ~2 epochs. If you aren't hitting memory errors, you can leave grad_accum=1; under tight memory, bump it to 4–8.

optimizer

AdamW (fp32) is classic, the highest quality ⭐ (12 AdamW). AdamW (8-bit) keeps the moments (m, v) in 8-bit → memory savings, quality ~the same. Under 50M, fp32 is enough; for 100M+, 8-bit may be needed.

precision

fp32 is everything 32-bit (slow but sure). mixed (f16 W) mirrors the weights in 16-bit, while gradients stay 32-bit (08 F16 Cast). On Apple Silicon, mixed is ~1.5–2× faster with no quality loss — in practice the right choice.

label_smooth (α)

Instead of "be 100% right", it says "95% is enough, spread out the rest"; it breaks excessive overconfidence (07 Cross-Entropy). 0.05 is mild ⭐, 0.1 is aggressive. The loss rises a bit, generalization improves.

z_loss (β)

A stabilizer that keeps the log-sum-exp of the output logits small; it prevents huge logits from forming and is compatible with mixed precision. 1e-4 is mild ⭐.

warmup — warmup ratio

The phase that slowly climbs the lr from the very start up to the peak (as a percentage of total steps). A big step from a cold start can blow up the gradient; warmup prevents this. Typically 5% in pretrain, 1% in continue/fine-tune. If the model is already warmed up, keep it short.

min_lr — cosine floor ratio

How far down the cosine decay goes (as a percentage of peak lr). 10% is standard; in fine-tune use 30% (don't drop the lr too far, to guard against an early plateau). Dropping to 0 freezes the model in the final steps.

weight_decay — L2 regularization

A penalty that gently pulls the weights toward zero; it reins in overfitting. 0.1 is the pretrain standard, 0.01 in fine-tune (less intervention).

grad_clip — gradient clipping

If the gradient norm exceeds this threshold, it scales it down and caps it — preventing a sudden blow-up (NaN) from wrecking the whole training run. 1.0 is typical; 0.5 for more delicate/continue training.

During training

  • Validation: the last 5% of the corpus is held out; val loss is measured every 100 steps. If the val loss starts rising while the train loss is falling → an overfit signal.
  • The loss curve is plotted live. Around ln(vocab) = random start; the lower it drops, the better.
  • Save Model (.llm): saves the current weights (and the optimizer state if you want).
  • Save Best: separately saves a snapshot of the lowest val moment across the run — so even if overfitting begins, you don't miss the best point.
Note

The .llm format carries the vocab embedded. If you try to resume a checkpoint on a .bin produced with a different vocab, the workbench warns you — the embedding tables won't match.


Step 3 — Generate (generate output)

Load a .llm, write a prompt, and let the tokens flow. The sampling settings determine "how the next token is picked" (sampling from the probability distribution in 07 Cross-Entropy):

SettingDefaultWhat it does
temperature0.8Sharpens/softens the distribution. Low = safe/repetitive, high = creative/scattered. Near 0 ≈ always the most likely token.
top-k40Pick only from the top k tokens; discard the rest.
top-p0.95Pick from the smallest set whose probability sum exceeds p (nucleus).
min-p0Discard candidates below this fraction of the most likely token's p.
rep (repetition penalty)1.1Trims the probability of previously seen tokens to keep it from looping.
tokens64The maximum number of tokens to generate.

Order of application: first temperature, then the repetition penalty, then the top-k → min-p → top-p filters, and sampling at the very end. A solid starting point: temp 0.7–0.9, top-p 0.9–0.95, rep 1.1.


Ready recipe — small Turkish model (~40M)

A verified, working starting point:

Model:
  d_model       512
  n_heads       8
  n_kv_heads    2  (GQA)
  n_layers      8
  d_ff mult     4×
  activation    SwiGLU
  → ~40M parameters

Pretrain:
  seq_len       512
  lr            3e-4         (cosine; min_lr 10%)
  steps         50k–100k
  grad_accum    4
  warmup        5%
  optimizer     AdamW (fp32) (8-bit if 100M+)
  precision     mixed (f16 W)
  label_smooth  0.05
  z_loss        1e-4
  weight_decay  0.1
  grad_clip     1.0

A Q&A fine-tune on top:

Same model params — DO NOT CHANGE
seq_len       512
lr            5e-5     ← drop it sharply
steps         1k–3k    ← keep it short
grad_accum    2
warmup        1%
min_lr        30%
weight_decay  0.01

Which model costs how much? (roughly, M3 Pro)

d_modeln_layers~Parameters~MemorySpeed
2566~7M200 MB~12 step/s
3846~19M500 MB~7 step/s
5128~39M930 MB~3 step/s
76812~110M2.5 GB~1 step/s

(assuming vocab=16K, seq=512; the real number changes with vocab and d_ff.)


Habits that pay off

  1. Sanity first: before the real run, confirm the pipeline works with a small config + 100 steps.
  2. Trust Save Best — it holds the best point even if overfitting begins, and the file is small.
  3. On the pretrain → fine-tune transition, always lower the lr (1/3–1/10).
  4. Same tokenizer — pretrain and fine-tune must share the same vocab.
  5. Don't let the computer sleep — on a long run, sleep drops the WebGPU context and the run breaks.
  6. Watch the val loss, not just the train loss — divergence means overfitting.

Common mistakes

  • ❌ Keeping the pretrain lr in fine-tune → catastrophic forgetting (the model forgets what it knew).
  • ❌ New vocab + old checkpoint → the embedding won't match, the model breaks.
  • ❌ Too high an lr (3e-3) → the loss blows up (NaN); grad_clip is the last line of defense.
  • ❌ A 50k-step fine-tune on a small corpus → letter-by-letter memorization + babbling.
  • ❌ Too large a seq_len on a small corpus → the windows are half empty, efficiency drops.
  • ❌ Trying to teach new knowledge with SFT/fine-tune → hallucination. Knowledge goes in during pretrain; fine-tune only teaches "how to convey it".

Next

Curious what spins behind the buttons? Overview opens up the whole pipeline, then each kernel opens its own study — from embedding to AdamW, line by line.

WGSL kernel studies · an LLM from scratch on WebGPUBuilt in Istanbul by Uğur Toprakdeviren.