How to Use — Driving the Workbench End to End
From raw text to a talking model: set up the tokenizer, train the model, generate output — and what every setting actually does.
The rest of the studies open up the kernels line by line. This page is the practical side: in the llm.istanbul workbench, which button does what, which parameter changes what. An end-to-end tour for training and running a small Turkish model from scratch.
The big picture — three tabs, two files
The workbench is three tabs; two kinds of files travel between them:
[raw text .txt]
│ Tokenizer tab → BPE vocab + tokenization
▼
[.bin] token sequence + embedded vocab
│ Train tab → forward / backward / AdamW
▼
[.llm] model weights + embedded vocab
│ Generate tab → continue the prompt
▼
[generated text]- Tokenizer → splits text into tokens, builds the BPE vocabulary, emits a
.bin. - Train → takes the
.bin, trains the transformer, saves a.llmcheckpoint. - Generate → takes the
.llm, continues the prompt you write.
An important detail: both .bin and .llm carry the vocab embedded inside. So even if you share a .llm file on its own, the vocabulary comes along with it; you don't need to ship a separate vocab file.
Step 1 — Tokenizer (set up the vocabulary)
The model sees tokens, not letters — word pieces (why and how: 01 Embedding, 13 BPE). The first job is to set up the vocabulary that splits your text into tokens.
- Drop in the text files (
.txt,.md, …). There can be several; "browse folder" scans the folder recursively. - Pick a vocab size — the number of tokens in the vocabulary.
| Vocab | When |
|---|---|
| 8K–16K | Small corpus, narrow domain |
| 32K | Good balance for Turkish ⭐ |
| 64K (default) | Wide corpus; the embedding table grows too |
Turkish is agglutinative — "ev-ler-imiz-den". A very small vocab shreds the suffixes apart; a very large vocab wastes parameters on rare tokens. For a medium-to-large corpus, 32K is the sweet spot; for toy experiments, 8K–16K is enough.
- "Train BPE & Export .bin" — runs the merges on the GPU, then tokenizes the whole text and downloads a
.bin. If you want, you can also save the vocabulary separately as.json, but the.binalready carries the vocab inside.
Pretrain and fine-tune must be done with the same vocab. If the vocabulary changes, the embedding table won't match the old checkpoint and the model breaks. Set up the vocabulary once for a corpus, then always use that one.
Step 2 — Train (train the model)
The heart of the work. First you pick the model's shape (how many parameters it will have), then how it will learn (the training settings).
The easiest path: start with a preset
Instead of tweaking dozens of settings one by one, pick a preset, then fine-tune. The three presets stamp out the following:
| Preset | What for | lr | steps | grad_accum | warmup | min_lr | weight_decay | grad_clip |
|---|---|---|---|---|---|---|---|---|
| Fresh pretrain | Training from scratch | 3e-4 | 50k | 4 | 5% | 10% | 0.1 | 1.0 |
| Continue pretrain | Resume from a checkpoint | 1e-4 | 30k | 4 | 10% | 10% | 0.1 | 0.5 |
| Fine-tune | Adapt a ready model to a task | 5e-5 | 10k | 2 | 1% | 30% | 0.01 | 0.5 |
The tab's default chips (d=128, 4 heads, 4 layers, seq=128, 500 steps) are deliberately toy/sanity sized — to see "does it work?" in a few seconds. When training a real model, switch to the "Ready recipe" below or to a preset.
Model size — how many parameters?
d_model — hidden size
The width of the model's "thinking space"; each token turns into a vector of this size (01 Embedding).
| Value | Use |
|---|---|
| 64–256 | Toy / sanity |
| 384–512 | Small production model ⭐ |
| 768+ | Deep semantics, slow |
The effect is squared: all matrices grow with d_model² — d=512 is ~1.8× the compute/memory of d=384.
n_heads — number of attention heads
Parallel "perspectives"; each head attends to a different relationship in the text (05 Attention). Rule: head_dim = d_model / n_heads, and 32–128 is a good range. d=512 → n_heads=8 → head_dim=64 ✓.
n_kv_heads — GQA grouping
Reduces the number of Key/Value heads to save memory. = n_heads (MHA, the highest quality but memory hungry), 2/4 (GQA — 4–5× KV memory savings, negligible quality loss ⭐), 1 (MQA, aggressive). Llama, Mistral, and Gemma all use GQA.
n_layers — depth
The number of transformer blocks; the model's "thinking steps". Linear effect: 8 layers = 2× the memory/time of 4 layers. For a small model, 6–8 is a good balance.
d_ff mult — FFN expansion ratio
The FFN layer is d_model × ratio wide (d=512 × 4 = 2048). 4× is the modern standard (06 Activation).
activation
GeLU (classic, 2 matrices) or SwiGLU (modern, Llama-style, 3 matrices — gate/up/down, a bit pricier but higher quality ⭐). Details: 06 Activation.
Training settings — how should it learn?
seq_len — context window
The token length of one example. Since attention is O(seq²), 1024 is ~4× as pricey as 512 (05 Attention). For Q&A and dialogue, 512 is more than enough; for a toy, 128–256.
lr — learning rate
The optimizer's step size; with a cosine schedule it climbs to peak and falls. 3e-4 is the "magic number" for the Adam family (pretrain ⭐). 1e-4 is safe/slow (fine-tune). 1e-3+ risks blowing up. In fine-tune, use 1/3–1/10 of the pretrain lr.
steps — number of steps
The total number of optimizer steps. Sanity 100, quick experiment 2k, serious pretrain 50k–100k. Fine-tune is far fewer (1k–3k) — on a small corpus, too many steps = memorization.
grad_accum and the "batch" question
This engine processes a single sequence in the browser — that is, batch = 1, there is no real batch. The "virtual batch" comes from grad_accum: it accumulates the gradients of N micro-steps (11 Backward Linear) and does a single optimizer update.
So tokens per step = seq_len × grad_accum. (The "batch=32" assumption that shows up in some general notes does not apply to this engine.)
tokens_per_step = seq_len × grad_accum
epoch_steps = corpus_tokens / tokens_per_step
epochs_seen = steps × tokens_per_step / corpus_tokensExample: a 50M-token corpus, seq_len=512, grad_accum=4 → 2048 tokens per step → 50k steps ≈ 102M tokens seen ≈ ~2 epochs. If you aren't hitting memory errors, you can leave grad_accum=1; under tight memory, bump it to 4–8.
optimizer
AdamW (fp32) is classic, the highest quality ⭐ (12 AdamW). AdamW (8-bit) keeps the moments (m, v) in 8-bit → memory savings, quality ~the same. Under 50M, fp32 is enough; for 100M+, 8-bit may be needed.
precision
fp32 is everything 32-bit (slow but sure). mixed (f16 W) mirrors the weights in 16-bit, while gradients stay 32-bit (08 F16 Cast). On Apple Silicon, mixed is ~1.5–2× faster with no quality loss — in practice the right choice.
label_smooth (α)
Instead of "be 100% right", it says "95% is enough, spread out the rest"; it breaks excessive overconfidence (07 Cross-Entropy). 0.05 is mild ⭐, 0.1 is aggressive. The loss rises a bit, generalization improves.
z_loss (β)
A stabilizer that keeps the log-sum-exp of the output logits small; it prevents huge logits from forming and is compatible with mixed precision. 1e-4 is mild ⭐.
warmup — warmup ratio
The phase that slowly climbs the lr from the very start up to the peak (as a percentage of total steps). A big step from a cold start can blow up the gradient; warmup prevents this. Typically 5% in pretrain, 1% in continue/fine-tune. If the model is already warmed up, keep it short.
min_lr — cosine floor ratio
How far down the cosine decay goes (as a percentage of peak lr). 10% is standard; in fine-tune use 30% (don't drop the lr too far, to guard against an early plateau). Dropping to 0 freezes the model in the final steps.
weight_decay — L2 regularization
A penalty that gently pulls the weights toward zero; it reins in overfitting. 0.1 is the pretrain standard, 0.01 in fine-tune (less intervention).
grad_clip — gradient clipping
If the gradient norm exceeds this threshold, it scales it down and caps it — preventing a sudden blow-up (NaN) from wrecking the whole training run. 1.0 is typical; 0.5 for more delicate/continue training.
During training
- Validation: the last 5% of the corpus is held out; val loss is measured every 100 steps. If the val loss starts rising while the train loss is falling → an overfit signal.
- The loss curve is plotted live. Around
ln(vocab)= random start; the lower it drops, the better. - Save Model (.llm): saves the current weights (and the optimizer state if you want).
- Save Best: separately saves a snapshot of the lowest val moment across the run — so even if overfitting begins, you don't miss the best point.
The .llm format carries the vocab embedded. If you try to resume a checkpoint on a .bin produced with a different vocab, the workbench warns you — the embedding tables won't match.
Step 3 — Generate (generate output)
Load a .llm, write a prompt, and let the tokens flow. The sampling settings determine "how the next token is picked" (sampling from the probability distribution in 07 Cross-Entropy):
| Setting | Default | What it does |
|---|---|---|
| temperature | 0.8 | Sharpens/softens the distribution. Low = safe/repetitive, high = creative/scattered. Near 0 ≈ always the most likely token. |
| top-k | 40 | Pick only from the top k tokens; discard the rest. |
| top-p | 0.95 | Pick from the smallest set whose probability sum exceeds p (nucleus). |
| min-p | 0 | Discard candidates below this fraction of the most likely token's p. |
| rep (repetition penalty) | 1.1 | Trims the probability of previously seen tokens to keep it from looping. |
| tokens | 64 | The maximum number of tokens to generate. |
Order of application: first temperature, then the repetition penalty, then the top-k → min-p → top-p filters, and sampling at the very end. A solid starting point: temp 0.7–0.9, top-p 0.9–0.95, rep 1.1.
Ready recipe — small Turkish model (~40M)
A verified, working starting point:
Model:
d_model 512
n_heads 8
n_kv_heads 2 (GQA)
n_layers 8
d_ff mult 4×
activation SwiGLU
→ ~40M parameters
Pretrain:
seq_len 512
lr 3e-4 (cosine; min_lr 10%)
steps 50k–100k
grad_accum 4
warmup 5%
optimizer AdamW (fp32) (8-bit if 100M+)
precision mixed (f16 W)
label_smooth 0.05
z_loss 1e-4
weight_decay 0.1
grad_clip 1.0A Q&A fine-tune on top:
Same model params — DO NOT CHANGE
seq_len 512
lr 5e-5 ← drop it sharply
steps 1k–3k ← keep it short
grad_accum 2
warmup 1%
min_lr 30%
weight_decay 0.01Which model costs how much? (roughly, M3 Pro)
| d_model | n_layers | ~Parameters | ~Memory | Speed |
|---|---|---|---|---|
| 256 | 6 | ~7M | 200 MB | ~12 step/s |
| 384 | 6 | ~19M | 500 MB | ~7 step/s |
| 512 | 8 | ~39M | 930 MB | ~3 step/s |
| 768 | 12 | ~110M | 2.5 GB | ~1 step/s |
(assuming vocab=16K, seq=512; the real number changes with vocab and d_ff.)
Habits that pay off
- Sanity first: before the real run, confirm the pipeline works with a small config + 100 steps.
- Trust Save Best — it holds the best point even if overfitting begins, and the file is small.
- On the pretrain → fine-tune transition, always lower the lr (1/3–1/10).
- Same tokenizer — pretrain and fine-tune must share the same vocab.
- Don't let the computer sleep — on a long run, sleep drops the WebGPU context and the run breaks.
- Watch the val loss, not just the train loss — divergence means overfitting.
Common mistakes
- ❌ Keeping the pretrain lr in fine-tune → catastrophic forgetting (the model forgets what it knew).
- ❌ New vocab + old checkpoint → the embedding won't match, the model breaks.
- ❌ Too high an lr (3e-3) → the loss blows up (NaN);
grad_clipis the last line of defense. - ❌ A 50k-step fine-tune on a small corpus → letter-by-letter memorization + babbling.
- ❌ Too large a seq_len on a small corpus → the windows are half empty, efficiency drops.
- ❌ Trying to teach new knowledge with SFT/fine-tune → hallucination. Knowledge goes in during pretrain; fine-tune only teaches "how to convey it".
Next
Curious what spins behind the buttons? Overview opens up the whole pipeline, then each kernel opens its own study — from embedding to AdamW, line by line.