docs(edu): outline simple LLM chapter and create section tickets [edu-u2w7]
Adds llm-from-scratch.md stub with 14 sections (GPT-1 style: character tokenisation, self-attention, transformer block, candle model, training loop, sampling). Creates beans edu-32xl through edu-9sb7 for each section.main
parent
818444962c
commit
05ac10f5e3
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-32xl
|
||||
title: 'Write §1: What is a language model?'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:47Z
|
||||
updated_at: 2026-03-13T22:01:47Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Next-token prediction as the core task. Intuitive framing: a model that guesses what comes next, trained on raw text. GPT-1 context. No code.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-7do4
|
||||
title: 'Write §2: Character-level tokenisation'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:48Z
|
||||
updated_at: 2026-03-13T22:01:48Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Explain BPE vs byte-level vs character-level. Motivate character-level as the simplest choice for a from-scratch exercise. Show vocabulary construction.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-9cnd
|
||||
title: 'Write §6: The Transformer block'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:55Z
|
||||
updated_at: 2026-03-13T22:01:55Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Attention sublayer + 2-layer feed-forward network + residual connections + layer norm. Describe the GPT-1 block layout. Diagrams encouraged.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-9sb7
|
||||
title: 'Write §14: Further reading'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:02:08Z
|
||||
updated_at: 2026-03-13T22:02:08Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Curated pointers: Attention is All You Need paper, GPT-1 paper, Karpathy's nanoGPT, candle docs, The Illustrated Transformer blog post.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-abdu
|
||||
title: 'Write §10: Cross-entropy loss and the training loop'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:02:02Z
|
||||
updated_at: 2026-03-13T22:02:02Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Next-token prediction loss: cross-entropy over the vocab. Adam optimiser. Training loop structure: batch → forward → loss → backward → step. No bells and whistles.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-hufe
|
||||
title: 'Write §7: Exercise 2 — implement self-attention in Rust'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:56Z
|
||||
updated_at: 2026-03-13T22:01:56Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Implement scaled dot-product attention using candle tensors. Single head, causal mask, softmax, output projection. Reader writes the core attention function.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-i76z
|
||||
title: 'Write §12: Exercise 5 — sample from the model'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:02:05Z
|
||||
updated_at: 2026-03-13T22:02:05Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Temperature sampling and greedy decoding. Prompt the trained model and decode character-by-character. Compare output at different training checkpoints.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-jybf
|
||||
title: 'Write §11: Exercise 4 — train on a small text corpus'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:02:04Z
|
||||
updated_at: 2026-03-13T22:02:04Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Use a small public-domain text (e.g. Shakespeare's sonnets or a children's book). Show data loading, batching with random windows, training loop, loss curve. Reader runs training and watches loss fall.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-kkjc
|
||||
title: 'Write §13: What limits this model?'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:02:07Z
|
||||
updated_at: 2026-03-13T22:02:07Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Honest assessment: context length, data size, model capacity, compute. Explain why GPT-1 was a big deal in 2018 and what GPT-2/3/4 changed. No code.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-s6mr
|
||||
title: 'Write §5: Self-attention — queries, keys, and values'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:53Z
|
||||
updated_at: 2026-03-13T22:01:53Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Derive the scaled dot-product attention formula from first principles. Single-head attention only (GPT-1 simplicity). Causal masking explained here.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-tufd
|
||||
title: 'Write §3: Exercise 1 — build a character-level tokeniser in Rust'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:50Z
|
||||
updated_at: 2026-03-13T22:01:50Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Implement encode/decode over a fixed character vocabulary. Read a text file, build vocab, encode to integers, decode back. No external crates.
|
||||
@ -0,0 +1,11 @@
|
||||
---
|
||||
# edu-vqxk
|
||||
title: 'Write §8: A decoder-only LM — stacking blocks and the causal mask'
|
||||
status: todo
|
||||
type: task
|
||||
created_at: 2026-03-13T22:01:58Z
|
||||
updated_at: 2026-03-13T22:01:58Z
|
||||
parent: edu-u2w7
|
||||
---
|
||||
|
||||
Explain how N transformer blocks are stacked. Causal mask ensures each position only attends to past tokens. Tie weights to the unembedding matrix (GPT-1 style). Final linear + softmax for logits.
|
||||
@ -0,0 +1,79 @@
|
||||
# Building a Simple LLM from Scratch
|
||||
|
||||
A hands-on course building a small GPT-1-style language model in Rust — from raw text to a trained, sampling transformer.
|
||||
|
||||
---
|
||||
|
||||
## Part 1 — Language Modeling Basics
|
||||
|
||||
### §1 What is a Language Model?
|
||||
|
||||
🚧 *To be written — see [edu-32xl]*
|
||||
|
||||
### §2 Character-Level Tokenisation
|
||||
|
||||
🚧 *To be written — see [edu-7do4]*
|
||||
|
||||
### §3 Exercise 1: Build a Character-Level Tokeniser in Rust
|
||||
|
||||
🚧 *To be written — see [edu-tufd]*
|
||||
|
||||
---
|
||||
|
||||
## Part 2 — The Transformer Architecture
|
||||
|
||||
### §4 Embeddings and Positional Encoding
|
||||
|
||||
🚧 *To be written — see [edu-cw9v]*
|
||||
|
||||
### §5 Self-Attention: Queries, Keys, and Values
|
||||
|
||||
🚧 *To be written — see [edu-s6mr]*
|
||||
|
||||
### §6 The Transformer Block
|
||||
|
||||
🚧 *To be written — see [edu-9cnd]*
|
||||
|
||||
### §7 Exercise 2: Implement Self-Attention in Rust
|
||||
|
||||
🚧 *To be written — see [edu-hufe]*
|
||||
|
||||
---
|
||||
|
||||
## Part 3 — Assembling the Model
|
||||
|
||||
### §8 A Decoder-Only LM: Stacking Blocks and the Causal Mask
|
||||
|
||||
🚧 *To be written — see [edu-vqxk]*
|
||||
|
||||
### §9 Exercise 3: Define the GPT-1-Style Model in `candle`
|
||||
|
||||
🚧 *To be written — see [edu-ujs5]*
|
||||
|
||||
---
|
||||
|
||||
## Part 4 — Training
|
||||
|
||||
### §10 Cross-Entropy Loss and the Training Loop
|
||||
|
||||
🚧 *To be written — see [edu-abdu]*
|
||||
|
||||
### §11 Exercise 4: Train on a Small Text Corpus
|
||||
|
||||
🚧 *To be written — see [edu-jybf]*
|
||||
|
||||
### §12 Exercise 5: Sample from the Model
|
||||
|
||||
🚧 *To be written — see [edu-i76z]*
|
||||
|
||||
---
|
||||
|
||||
## Part 5 — Reflection
|
||||
|
||||
### §13 What Limits This Model?
|
||||
|
||||
🚧 *To be written — see [edu-kkjc]*
|
||||
|
||||
### §14 Further Reading
|
||||
|
||||
🚧 *To be written — see [edu-9sb7]*
|
||||
Loading…
Reference in New Issue