diff --git a/edu/.beans/edu-32xl--write-1-what-is-a-language-model.md b/edu/.beans/edu-32xl--write-1-what-is-a-language-model.md new file mode 100644 index 0000000..214286d --- /dev/null +++ b/edu/.beans/edu-32xl--write-1-what-is-a-language-model.md @@ -0,0 +1,11 @@ +--- +# edu-32xl +title: 'Write §1: What is a language model?' +status: todo +type: task +created_at: 2026-03-13T22:01:47Z +updated_at: 2026-03-13T22:01:47Z +parent: edu-u2w7 +--- + +Next-token prediction as the core task. Intuitive framing: a model that guesses what comes next, trained on raw text. GPT-1 context. No code. diff --git a/edu/.beans/edu-7do4--write-2-character-level-tokenisation.md b/edu/.beans/edu-7do4--write-2-character-level-tokenisation.md new file mode 100644 index 0000000..60ccbfa --- /dev/null +++ b/edu/.beans/edu-7do4--write-2-character-level-tokenisation.md @@ -0,0 +1,11 @@ +--- +# edu-7do4 +title: 'Write §2: Character-level tokenisation' +status: todo +type: task +created_at: 2026-03-13T22:01:48Z +updated_at: 2026-03-13T22:01:48Z +parent: edu-u2w7 +--- + +Explain BPE vs byte-level vs character-level. Motivate character-level as the simplest choice for a from-scratch exercise. Show vocabulary construction. diff --git a/edu/.beans/edu-9cnd--write-6-the-transformer-block.md b/edu/.beans/edu-9cnd--write-6-the-transformer-block.md new file mode 100644 index 0000000..3792288 --- /dev/null +++ b/edu/.beans/edu-9cnd--write-6-the-transformer-block.md @@ -0,0 +1,11 @@ +--- +# edu-9cnd +title: 'Write §6: The Transformer block' +status: todo +type: task +created_at: 2026-03-13T22:01:55Z +updated_at: 2026-03-13T22:01:55Z +parent: edu-u2w7 +--- + +Attention sublayer + 2-layer feed-forward network + residual connections + layer norm. Describe the GPT-1 block layout. Diagrams encouraged. diff --git a/edu/.beans/edu-9sb7--write-14-further-reading.md b/edu/.beans/edu-9sb7--write-14-further-reading.md new file mode 100644 index 0000000..98369ee --- /dev/null +++ b/edu/.beans/edu-9sb7--write-14-further-reading.md @@ -0,0 +1,11 @@ +--- +# edu-9sb7 +title: 'Write §14: Further reading' +status: todo +type: task +created_at: 2026-03-13T22:02:08Z +updated_at: 2026-03-13T22:02:08Z +parent: edu-u2w7 +--- + +Curated pointers: Attention is All You Need paper, GPT-1 paper, Karpathy's nanoGPT, candle docs, The Illustrated Transformer blog post. diff --git a/edu/.beans/edu-abdu--write-10-cross-entropy-loss-and-the-training-loop.md b/edu/.beans/edu-abdu--write-10-cross-entropy-loss-and-the-training-loop.md new file mode 100644 index 0000000..1d7f58a --- /dev/null +++ b/edu/.beans/edu-abdu--write-10-cross-entropy-loss-and-the-training-loop.md @@ -0,0 +1,11 @@ +--- +# edu-abdu +title: 'Write §10: Cross-entropy loss and the training loop' +status: todo +type: task +created_at: 2026-03-13T22:02:02Z +updated_at: 2026-03-13T22:02:02Z +parent: edu-u2w7 +--- + +Next-token prediction loss: cross-entropy over the vocab. Adam optimiser. Training loop structure: batch → forward → loss → backward → step. No bells and whistles. diff --git a/edu/.beans/edu-cw9v--write-4-embeddings-and-positional-encoding.md b/edu/.beans/edu-cw9v--write-4-embeddings-and-positional-encoding.md new file mode 100644 index 0000000..926d407 --- /dev/null +++ b/edu/.beans/edu-cw9v--write-4-embeddings-and-positional-encoding.md @@ -0,0 +1,11 @@ +--- +# edu-cw9v +title: 'Write §4: Embeddings and positional encoding' +status: todo +type: task +created_at: 2026-03-13T22:01:52Z +updated_at: 2026-03-13T22:01:52Z +parent: edu-u2w7 +--- + +Token embedding table (vocab_size × d_model). Learned positional embeddings (GPT-1 style). Explain why position matters for attention. diff --git a/edu/.beans/edu-hufe--write-7-exercise-2-implement-self-attention-in-rus.md b/edu/.beans/edu-hufe--write-7-exercise-2-implement-self-attention-in-rus.md new file mode 100644 index 0000000..78fc711 --- /dev/null +++ b/edu/.beans/edu-hufe--write-7-exercise-2-implement-self-attention-in-rus.md @@ -0,0 +1,11 @@ +--- +# edu-hufe +title: 'Write §7: Exercise 2 — implement self-attention in Rust' +status: todo +type: task +created_at: 2026-03-13T22:01:56Z +updated_at: 2026-03-13T22:01:56Z +parent: edu-u2w7 +--- + +Implement scaled dot-product attention using candle tensors. Single head, causal mask, softmax, output projection. Reader writes the core attention function. diff --git a/edu/.beans/edu-i76z--write-12-exercise-5-sample-from-the-model.md b/edu/.beans/edu-i76z--write-12-exercise-5-sample-from-the-model.md new file mode 100644 index 0000000..75c23ee --- /dev/null +++ b/edu/.beans/edu-i76z--write-12-exercise-5-sample-from-the-model.md @@ -0,0 +1,11 @@ +--- +# edu-i76z +title: 'Write §12: Exercise 5 — sample from the model' +status: todo +type: task +created_at: 2026-03-13T22:02:05Z +updated_at: 2026-03-13T22:02:05Z +parent: edu-u2w7 +--- + +Temperature sampling and greedy decoding. Prompt the trained model and decode character-by-character. Compare output at different training checkpoints. diff --git a/edu/.beans/edu-jybf--write-11-exercise-4-train-on-a-small-text-corpus.md b/edu/.beans/edu-jybf--write-11-exercise-4-train-on-a-small-text-corpus.md new file mode 100644 index 0000000..c08a0ed --- /dev/null +++ b/edu/.beans/edu-jybf--write-11-exercise-4-train-on-a-small-text-corpus.md @@ -0,0 +1,11 @@ +--- +# edu-jybf +title: 'Write §11: Exercise 4 — train on a small text corpus' +status: todo +type: task +created_at: 2026-03-13T22:02:04Z +updated_at: 2026-03-13T22:02:04Z +parent: edu-u2w7 +--- + +Use a small public-domain text (e.g. Shakespeare's sonnets or a children's book). Show data loading, batching with random windows, training loop, loss curve. Reader runs training and watches loss fall. diff --git a/edu/.beans/edu-kkjc--write-13-what-limits-this-model.md b/edu/.beans/edu-kkjc--write-13-what-limits-this-model.md new file mode 100644 index 0000000..e1e536a --- /dev/null +++ b/edu/.beans/edu-kkjc--write-13-what-limits-this-model.md @@ -0,0 +1,11 @@ +--- +# edu-kkjc +title: 'Write §13: What limits this model?' +status: todo +type: task +created_at: 2026-03-13T22:02:07Z +updated_at: 2026-03-13T22:02:07Z +parent: edu-u2w7 +--- + +Honest assessment: context length, data size, model capacity, compute. Explain why GPT-1 was a big deal in 2018 and what GPT-2/3/4 changed. No code. diff --git a/edu/.beans/edu-s6mr--write-5-self-attention-queries-keys-and-values.md b/edu/.beans/edu-s6mr--write-5-self-attention-queries-keys-and-values.md new file mode 100644 index 0000000..4d83e43 --- /dev/null +++ b/edu/.beans/edu-s6mr--write-5-self-attention-queries-keys-and-values.md @@ -0,0 +1,11 @@ +--- +# edu-s6mr +title: 'Write §5: Self-attention — queries, keys, and values' +status: todo +type: task +created_at: 2026-03-13T22:01:53Z +updated_at: 2026-03-13T22:01:53Z +parent: edu-u2w7 +--- + +Derive the scaled dot-product attention formula from first principles. Single-head attention only (GPT-1 simplicity). Causal masking explained here. diff --git a/edu/.beans/edu-tufd--write-3-exercise-1-build-a-character-level-tokenis.md b/edu/.beans/edu-tufd--write-3-exercise-1-build-a-character-level-tokenis.md new file mode 100644 index 0000000..d804941 --- /dev/null +++ b/edu/.beans/edu-tufd--write-3-exercise-1-build-a-character-level-tokenis.md @@ -0,0 +1,11 @@ +--- +# edu-tufd +title: 'Write §3: Exercise 1 — build a character-level tokeniser in Rust' +status: todo +type: task +created_at: 2026-03-13T22:01:50Z +updated_at: 2026-03-13T22:01:50Z +parent: edu-u2w7 +--- + +Implement encode/decode over a fixed character vocabulary. Read a text file, build vocab, encode to integers, decode back. No external crates. diff --git a/edu/.beans/edu-u2w7--edu-write-chapter-on-creating-and-training-a-simpl.md b/edu/.beans/edu-u2w7--edu-write-chapter-on-creating-and-training-a-simpl.md index 3d90c9e..91160bc 100644 --- a/edu/.beans/edu-u2w7--edu-write-chapter-on-creating-and-training-a-simpl.md +++ b/edu/.beans/edu-u2w7--edu-write-chapter-on-creating-and-training-a-simpl.md @@ -1,11 +1,11 @@ --- # edu-u2w7 title: 'edu: write chapter on creating and training a simple LLM' -status: todo -type: task +status: in-progress +type: feature priority: low created_at: 2026-03-10T23:30:00Z -updated_at: 2026-03-10T23:30:00Z +updated_at: 2026-03-13T22:01:43Z --- ## Background diff --git a/edu/.beans/edu-ujs5--write-9-exercise-3-define-the-gpt-1-style-model-in.md b/edu/.beans/edu-ujs5--write-9-exercise-3-define-the-gpt-1-style-model-in.md new file mode 100644 index 0000000..26b5a71 --- /dev/null +++ b/edu/.beans/edu-ujs5--write-9-exercise-3-define-the-gpt-1-style-model-in.md @@ -0,0 +1,11 @@ +--- +# edu-ujs5 +title: 'Write §9: Exercise 3 — define the GPT-1-style model in candle' +status: todo +type: task +created_at: 2026-03-13T22:02:00Z +updated_at: 2026-03-13T22:02:00Z +parent: edu-u2w7 +--- + +Full model struct in candle: embedding, N transformer blocks, layer norm, unembedding. Hyperparams close to GPT-1 mini (e.g. 2–4 layers, d_model=128). Reader assembles the forward pass. diff --git a/edu/.beans/edu-vqxk--write-8-a-decoder-only-lm-stacking-blocks-and-the.md b/edu/.beans/edu-vqxk--write-8-a-decoder-only-lm-stacking-blocks-and-the.md new file mode 100644 index 0000000..24a8b2a --- /dev/null +++ b/edu/.beans/edu-vqxk--write-8-a-decoder-only-lm-stacking-blocks-and-the.md @@ -0,0 +1,11 @@ +--- +# edu-vqxk +title: 'Write §8: A decoder-only LM — stacking blocks and the causal mask' +status: todo +type: task +created_at: 2026-03-13T22:01:58Z +updated_at: 2026-03-13T22:01:58Z +parent: edu-u2w7 +--- + +Explain how N transformer blocks are stacked. Causal mask ensures each position only attends to past tokens. Tie weights to the unembedding matrix (GPT-1 style). Final linear + softmax for logits. diff --git a/edu/src/SUMMARY.md b/edu/src/SUMMARY.md index 5137c70..5d5512c 100644 --- a/edu/src/SUMMARY.md +++ b/edu/src/SUMMARY.md @@ -25,3 +25,4 @@ # Machine Learning - [Training a Game AI Through Self-Play](ml-self-play.md) +- [Building a Simple LLM from Scratch](llm-from-scratch.md) diff --git a/edu/src/llm-from-scratch.md b/edu/src/llm-from-scratch.md new file mode 100644 index 0000000..4e97377 --- /dev/null +++ b/edu/src/llm-from-scratch.md @@ -0,0 +1,79 @@ +# Building a Simple LLM from Scratch + +A hands-on course building a small GPT-1-style language model in Rust — from raw text to a trained, sampling transformer. + +--- + +## Part 1 — Language Modeling Basics + +### §1 What is a Language Model? + +🚧 *To be written — see [edu-32xl]* + +### §2 Character-Level Tokenisation + +🚧 *To be written — see [edu-7do4]* + +### §3 Exercise 1: Build a Character-Level Tokeniser in Rust + +🚧 *To be written — see [edu-tufd]* + +--- + +## Part 2 — The Transformer Architecture + +### §4 Embeddings and Positional Encoding + +🚧 *To be written — see [edu-cw9v]* + +### §5 Self-Attention: Queries, Keys, and Values + +🚧 *To be written — see [edu-s6mr]* + +### §6 The Transformer Block + +🚧 *To be written — see [edu-9cnd]* + +### §7 Exercise 2: Implement Self-Attention in Rust + +🚧 *To be written — see [edu-hufe]* + +--- + +## Part 3 — Assembling the Model + +### §8 A Decoder-Only LM: Stacking Blocks and the Causal Mask + +🚧 *To be written — see [edu-vqxk]* + +### §9 Exercise 3: Define the GPT-1-Style Model in `candle` + +🚧 *To be written — see [edu-ujs5]* + +--- + +## Part 4 — Training + +### §10 Cross-Entropy Loss and the Training Loop + +🚧 *To be written — see [edu-abdu]* + +### §11 Exercise 4: Train on a Small Text Corpus + +🚧 *To be written — see [edu-jybf]* + +### §12 Exercise 5: Sample from the Model + +🚧 *To be written — see [edu-i76z]* + +--- + +## Part 5 — Reflection + +### §13 What Limits This Model? + +🚧 *To be written — see [edu-kkjc]* + +### §14 Further Reading + +🚧 *To be written — see [edu-9sb7]*