docs(edu): outline simple LLM chapter and create section tickets [edu-u2w7]

Adds llm-from-scratch.md stub with 14 sections (GPT-1 style: character
tokenisation, self-attention, transformer block, candle model, training
loop, sampling). Creates beans edu-32xl through edu-9sb7 for each section.
main
Elijah Voigt 3 months ago
parent 818444962c
commit 05ac10f5e3

@ -0,0 +1,11 @@
---
# edu-32xl
title: 'Write §1: What is a language model?'
status: todo
type: task
created_at: 2026-03-13T22:01:47Z
updated_at: 2026-03-13T22:01:47Z
parent: edu-u2w7
---
Next-token prediction as the core task. Intuitive framing: a model that guesses what comes next, trained on raw text. GPT-1 context. No code.

@ -0,0 +1,11 @@
---
# edu-7do4
title: 'Write §2: Character-level tokenisation'
status: todo
type: task
created_at: 2026-03-13T22:01:48Z
updated_at: 2026-03-13T22:01:48Z
parent: edu-u2w7
---
Explain BPE vs byte-level vs character-level. Motivate character-level as the simplest choice for a from-scratch exercise. Show vocabulary construction.

@ -0,0 +1,11 @@
---
# edu-9cnd
title: 'Write §6: The Transformer block'
status: todo
type: task
created_at: 2026-03-13T22:01:55Z
updated_at: 2026-03-13T22:01:55Z
parent: edu-u2w7
---
Attention sublayer + 2-layer feed-forward network + residual connections + layer norm. Describe the GPT-1 block layout. Diagrams encouraged.

@ -0,0 +1,11 @@
---
# edu-9sb7
title: 'Write §14: Further reading'
status: todo
type: task
created_at: 2026-03-13T22:02:08Z
updated_at: 2026-03-13T22:02:08Z
parent: edu-u2w7
---
Curated pointers: Attention is All You Need paper, GPT-1 paper, Karpathy's nanoGPT, candle docs, The Illustrated Transformer blog post.

@ -0,0 +1,11 @@
---
# edu-abdu
title: 'Write §10: Cross-entropy loss and the training loop'
status: todo
type: task
created_at: 2026-03-13T22:02:02Z
updated_at: 2026-03-13T22:02:02Z
parent: edu-u2w7
---
Next-token prediction loss: cross-entropy over the vocab. Adam optimiser. Training loop structure: batch → forward → loss → backward → step. No bells and whistles.

@ -0,0 +1,11 @@
---
# edu-cw9v
title: 'Write §4: Embeddings and positional encoding'
status: todo
type: task
created_at: 2026-03-13T22:01:52Z
updated_at: 2026-03-13T22:01:52Z
parent: edu-u2w7
---
Token embedding table (vocab_size × d_model). Learned positional embeddings (GPT-1 style). Explain why position matters for attention.

@ -0,0 +1,11 @@
---
# edu-hufe
title: 'Write §7: Exercise 2 — implement self-attention in Rust'
status: todo
type: task
created_at: 2026-03-13T22:01:56Z
updated_at: 2026-03-13T22:01:56Z
parent: edu-u2w7
---
Implement scaled dot-product attention using candle tensors. Single head, causal mask, softmax, output projection. Reader writes the core attention function.

@ -0,0 +1,11 @@
---
# edu-i76z
title: 'Write §12: Exercise 5 — sample from the model'
status: todo
type: task
created_at: 2026-03-13T22:02:05Z
updated_at: 2026-03-13T22:02:05Z
parent: edu-u2w7
---
Temperature sampling and greedy decoding. Prompt the trained model and decode character-by-character. Compare output at different training checkpoints.

@ -0,0 +1,11 @@
---
# edu-jybf
title: 'Write §11: Exercise 4 — train on a small text corpus'
status: todo
type: task
created_at: 2026-03-13T22:02:04Z
updated_at: 2026-03-13T22:02:04Z
parent: edu-u2w7
---
Use a small public-domain text (e.g. Shakespeare's sonnets or a children's book). Show data loading, batching with random windows, training loop, loss curve. Reader runs training and watches loss fall.

@ -0,0 +1,11 @@
---
# edu-kkjc
title: 'Write §13: What limits this model?'
status: todo
type: task
created_at: 2026-03-13T22:02:07Z
updated_at: 2026-03-13T22:02:07Z
parent: edu-u2w7
---
Honest assessment: context length, data size, model capacity, compute. Explain why GPT-1 was a big deal in 2018 and what GPT-2/3/4 changed. No code.

@ -0,0 +1,11 @@
---
# edu-s6mr
title: 'Write §5: Self-attention — queries, keys, and values'
status: todo
type: task
created_at: 2026-03-13T22:01:53Z
updated_at: 2026-03-13T22:01:53Z
parent: edu-u2w7
---
Derive the scaled dot-product attention formula from first principles. Single-head attention only (GPT-1 simplicity). Causal masking explained here.

@ -0,0 +1,11 @@
---
# edu-tufd
title: 'Write §3: Exercise 1 — build a character-level tokeniser in Rust'
status: todo
type: task
created_at: 2026-03-13T22:01:50Z
updated_at: 2026-03-13T22:01:50Z
parent: edu-u2w7
---
Implement encode/decode over a fixed character vocabulary. Read a text file, build vocab, encode to integers, decode back. No external crates.

@ -1,11 +1,11 @@
--- ---
# edu-u2w7 # edu-u2w7
title: 'edu: write chapter on creating and training a simple LLM' title: 'edu: write chapter on creating and training a simple LLM'
status: todo status: in-progress
type: task type: feature
priority: low priority: low
created_at: 2026-03-10T23:30:00Z created_at: 2026-03-10T23:30:00Z
updated_at: 2026-03-10T23:30:00Z updated_at: 2026-03-13T22:01:43Z
--- ---
## Background ## Background

@ -0,0 +1,11 @@
---
# edu-ujs5
title: 'Write §9: Exercise 3 — define the GPT-1-style model in candle'
status: todo
type: task
created_at: 2026-03-13T22:02:00Z
updated_at: 2026-03-13T22:02:00Z
parent: edu-u2w7
---
Full model struct in candle: embedding, N transformer blocks, layer norm, unembedding. Hyperparams close to GPT-1 mini (e.g. 24 layers, d_model=128). Reader assembles the forward pass.

@ -0,0 +1,11 @@
---
# edu-vqxk
title: 'Write §8: A decoder-only LM — stacking blocks and the causal mask'
status: todo
type: task
created_at: 2026-03-13T22:01:58Z
updated_at: 2026-03-13T22:01:58Z
parent: edu-u2w7
---
Explain how N transformer blocks are stacked. Causal mask ensures each position only attends to past tokens. Tie weights to the unembedding matrix (GPT-1 style). Final linear + softmax for logits.

@ -25,3 +25,4 @@
# Machine Learning # Machine Learning
- [Training a Game AI Through Self-Play](ml-self-play.md) - [Training a Game AI Through Self-Play](ml-self-play.md)
- [Building a Simple LLM from Scratch](llm-from-scratch.md)

@ -0,0 +1,79 @@
# Building a Simple LLM from Scratch
A hands-on course building a small GPT-1-style language model in Rust — from raw text to a trained, sampling transformer.
---
## Part 1 — Language Modeling Basics
### §1 What is a Language Model?
🚧 *To be written — see [edu-32xl]*
### §2 Character-Level Tokenisation
🚧 *To be written — see [edu-7do4]*
### §3 Exercise 1: Build a Character-Level Tokeniser in Rust
🚧 *To be written — see [edu-tufd]*
---
## Part 2 — The Transformer Architecture
### §4 Embeddings and Positional Encoding
🚧 *To be written — see [edu-cw9v]*
### §5 Self-Attention: Queries, Keys, and Values
🚧 *To be written — see [edu-s6mr]*
### §6 The Transformer Block
🚧 *To be written — see [edu-9cnd]*
### §7 Exercise 2: Implement Self-Attention in Rust
🚧 *To be written — see [edu-hufe]*
---
## Part 3 — Assembling the Model
### §8 A Decoder-Only LM: Stacking Blocks and the Causal Mask
🚧 *To be written — see [edu-vqxk]*
### §9 Exercise 3: Define the GPT-1-Style Model in `candle`
🚧 *To be written — see [edu-ujs5]*
---
## Part 4 — Training
### §10 Cross-Entropy Loss and the Training Loop
🚧 *To be written — see [edu-abdu]*
### §11 Exercise 4: Train on a Small Text Corpus
🚧 *To be written — see [edu-jybf]*
### §12 Exercise 5: Sample from the Model
🚧 *To be written — see [edu-i76z]*
---
## Part 5 — Reflection
### §13 What Limits This Model?
🚧 *To be written — see [edu-kkjc]*
### §14 Further Reading
🚧 *To be written — see [edu-9sb7]*
Loading…
Cancel
Save