diff --git a/edu/src/llm-from-scratch.md b/edu/src/llm-from-scratch.md index 4e97377..47a80d1 100644 --- a/edu/src/llm-from-scratch.md +++ b/edu/src/llm-from-scratch.md @@ -8,15 +8,341 @@ A hands-on course building a small GPT-1-style language model in Rust โ€” from r ### ยง1 What is a Language Model? -๐Ÿšง *To be written โ€” see [edu-32xl]* +A **language model** is a system that assigns probabilities to sequences of tokens. Given some context โ€” a sequence of words, characters, or subwords that have appeared so far โ€” a language model answers the question: + +> *"What token is most likely to come next?"* + +Formally, a language model estimates the conditional probability distribution: + +\\[ P(t_{n+1} \mid t_1, t_2, \ldots, t_n) \\] + +where \\( t_1, t_2, \ldots, t_n \\) is the context (the tokens seen so far) and \\( t_{n+1} \\) is the next token. + +#### A concrete example + +Suppose we are building a character-level language model trained on English text. Given the context: + +``` +The cat sat on the m +``` + +Our model might produce a probability distribution like: + +| Next character | Probability | +|----------------|-------------| +| `a` | 0.55 | +| `o` | 0.15 | +| `e` | 0.10 | +| `i` | 0.05 | +| (other) | 0.15 | + +The model thinks `a` is most likely (leading to "mat", "map", "man", etc.), followed by `o` ("moon", "mop", etc.). It learned these patterns from the statistics of its training data โ€” it has never been told English grammar rules. + +#### Autoregressive generation + +Language models generate text **autoregressively** โ€” one token at a time, feeding each generated token back in as context for the next prediction: + +``` +Step 1: "The cat" โ†’ predict 's' +Step 2: "The cats" โ†’ predict 'a' +Step 3: "The catsa" โ†’ predict 't' +...and so on +``` + +This loop of predict-then-append is the core mechanism behind every text-generating AI, from simple bigram models to GPT-4. + +#### Why is language modeling useful? + +Language modeling sounds like a narrow statistical task, but it turns out to be remarkably powerful: + +1. **Text generation.** Chatbots, story writers, and code assistants all generate text by sampling from a language model. +2. **Representation learning.** Training a model to predict the next token forces it to learn deep representations of syntax, semantics, and even factual knowledge. +3. **Foundation for downstream tasks.** Models pre-trained on language modeling can be fine-tuned for translation, summarisation, question answering, and more. +4. **Compression perspective.** A good language model is a good compressor โ€” it can represent text efficiently by encoding only the "surprising" tokens. This connection to information theory is why language modeling is such a fundamental problem. + +#### The spectrum of language models + +Language models range from the trivially simple to the extraordinarily complex: + +``` +Simple Complex + | | + Bigram โ”€โ”€โ”€โ”€ N-gram โ”€โ”€โ”€โ”€ RNN โ”€โ”€โ”€โ”€ LSTM โ”€โ”€โ”€โ”€ Transformer โ”€โ”€ GPT-4 + counts tables neural gated attention massive + net RNN mechanism scale +``` + +In this course, we will build a small **Transformer-based** language model โ€” the same architecture family that powers GPT, Claude, and other modern LLMs. Ours will be tiny (a few thousand parameters), but it contains every essential component of its larger cousins. + +> **Key takeaway:** A language model predicts the next token given context. Despite sounding simple, this task โ€” when scaled up โ€” gives rise to the capabilities we see in modern AI systems. ### ยง2 Character-Level Tokenisation -๐Ÿšง *To be written โ€” see [edu-7do4]* +Before a language model can process text, we need to convert raw text into numbers. This process is called **tokenisation** โ€” splitting text into discrete units (tokens) and mapping each to a numerical ID. + +#### Three approaches to tokenisation + +There are three main strategies, each with different tradeoffs: + +| Strategy | Token unit | Vocabulary size | Example: "cats" | +|----------|-----------|----------------|-----------------| +| **Character-level** | Single characters | ~100 | `['c', 'a', 't', 's']` | +| **Subword** (BPE, etc.) | Character groups | ~30,000-100,000 | `['cat', 's']` | +| **Word-level** | Whole words | ~100,000+ | `['cats']` | + +**Word-level** tokenisation is simple but struggles with misspellings, rare words, and morphological variation ("run", "running", "ran" are all separate tokens). + +**Subword** methods like Byte-Pair Encoding (BPE) โ€” used by GPT models โ€” strike a balance: common words get their own token, while rare words are split into pieces. The word "unhappiness" might become `["un", "happiness"]`. + +**Character-level** tokenisation is the simplest: every character in the text is its own token. The vocabulary is tiny โ€” just the set of unique characters in the training data. + +#### Why character-level for this course? + +We use character-level tokenisation because: + +1. **Simplicity.** No external tokeniser library is needed. We can build it from scratch in a few lines of Rust. +2. **Small vocabulary.** A typical English text corpus has fewer than 100 unique characters, which means smaller embedding tables and faster training. +3. **No unknown tokens.** Any character in the input can be represented โ€” there is no "out of vocabulary" problem. +4. **Educational clarity.** It is easy to inspect what the model is learning when each token is a single visible character. + +The downsides are that sequences become long (the word "language" is 8 tokens instead of 1) and the model must learn to spell words from individual characters. For our small educational model, these tradeoffs are acceptable. + +#### Building a vocabulary + +Given a training corpus, we construct a vocabulary by: + +1. Collecting all unique characters in the text. +2. Sorting them (for deterministic ordering). +3. Assigning each character a unique integer ID. + +For example, given the text `"hello world"`: + +``` +Unique characters (sorted): [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w'] + +Character โ†’ ID mapping: + ' ' โ†’ 0 + 'd' โ†’ 1 + 'e' โ†’ 2 + 'h' โ†’ 3 + 'l' โ†’ 4 + 'o' โ†’ 5 + 'r' โ†’ 6 + 'w' โ†’ 7 +``` + +#### Encoding and decoding + +**Encoding** converts a string into a sequence of token IDs: + +``` +"hello" โ†’ [3, 2, 4, 4, 5] +``` + +**Decoding** converts a sequence of token IDs back into a string: + +``` +[7, 5, 6, 4, 1] โ†’ "world" +``` + +These two operations must be perfect inverses โ€” `decode(encode(text)) == text` โ€” or we lose information. + +#### A note on special tokens + +Production tokenisers often include special tokens like `` (padding), `` (beginning of sequence), and `` (end of sequence). For our simple character-level tokeniser, we will not use these โ€” every token corresponds to a real character in the text. + +> **Key takeaway:** Character-level tokenisation maps each character to an integer. It is the simplest tokenisation scheme and ideal for learning, though real LLMs use subword methods for efficiency. ### ยง3 Exercise 1: Build a Character-Level Tokeniser in Rust -๐Ÿšง *To be written โ€” see [edu-tufd]* +In this exercise, we build a `CharTokeniser` struct that can encode text into token IDs and decode them back. + +#### Project setup + +Create a new Rust project: + +```bash +cargo new llm-from-scratch +cd llm-from-scratch +``` + +Your `Cargo.toml` needs only the standard library for now. We will add `candle` in later exercises: + +```toml +[package] +name = "llm-from-scratch" +version = "0.1.0" +edition = "2021" + +[dependencies] +# We will add candle-core and candle-nn later + +[profile.release] +opt-level = "z" +lto = true +strip = true +codegen-units = 1 +``` + +#### The `CharTokeniser` struct + +Create `src/tokeniser.rs`: + +```rust +use std::collections::HashMap; + +/// A character-level tokeniser that maps individual characters +/// to integer IDs and back. +pub struct CharTokeniser { + /// Maps each character to its token ID. + char_to_id: HashMap, + /// Maps each token ID back to its character. + id_to_char: Vec, +} + +impl CharTokeniser { + /// Build a tokeniser from a training corpus. + /// + /// Collects all unique characters, sorts them, and assigns + /// sequential IDs starting from 0. + /// + /// # Example + /// ``` + /// let tok = CharTokeniser::from_corpus("hello world"); + /// assert_eq!(tok.vocab_size(), 8); // ' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w' + /// ``` + pub fn from_corpus(text: &str) -> Self { + let mut chars: Vec = text.chars().collect::>() + .into_iter() + .collect(); + chars.sort(); + + let char_to_id: HashMap = chars + .iter() + .enumerate() + .map(|(i, &c)| (c, i as u32)) + .collect(); + + CharTokeniser { + char_to_id, + id_to_char: chars, + } + } + + /// Returns the number of unique tokens in the vocabulary. + pub fn vocab_size(&self) -> usize { + self.id_to_char.len() + } + + /// Encode a string into a sequence of token IDs. + /// + /// # Panics + /// Panics if the string contains a character not in the vocabulary. + pub fn encode(&self, text: &str) -> Vec { + text.chars() + .map(|c| { + *self.char_to_id + .get(&c) + .unwrap_or_else(|| panic!("Character '{}' not in vocabulary", c)) + }) + .collect() + } + + /// Decode a sequence of token IDs back into a string. + /// + /// # Panics + /// Panics if any token ID is out of range. + pub fn decode(&self, ids: &[u32]) -> String { + ids.iter() + .map(|&id| { + *self.id_to_char + .get(id as usize) + .unwrap_or_else(|| panic!("Token ID {} out of range", id)) + }) + .collect() + } + + /// Print the vocabulary mapping for inspection. + pub fn print_vocab(&self) { + println!("Vocabulary ({} tokens):", self.vocab_size()); + for (i, c) in self.id_to_char.iter().enumerate() { + let display = match c { + '\n' => "\\n".to_string(), + '\t' => "\\t".to_string(), + ' ' => "' '".to_string(), + _ => format!("'{}'", c), + }; + println!(" {} โ†’ {}", display, i); + } + } +} +``` + +#### Wiring it up in `main.rs` + +In `src/main.rs`: + +```rust +mod tokeniser; + +use tokeniser::CharTokeniser; + +fn main() { + let corpus = "\ +To be, or not to be, that is the question: +Whether 'tis nobler in the mind to suffer +The slings and arrows of outrageous fortune, +Or to take arms against a sea of troubles."; + + let tok = CharTokeniser::from_corpus(corpus); + tok.print_vocab(); + + let sample = "to be"; + let encoded = tok.encode(sample); + println!("\nEncoded \"{}\": {:?}", sample, encoded); + + let decoded = tok.decode(&encoded); + println!("Decoded back: \"{}\"", decoded); + + // Verify round-trip + assert_eq!(decoded, sample); + println!("\nRound-trip check passed."); +} +``` + +#### Expected output + +When you run `cargo run`, you should see something like: + +``` +Vocabulary (39 tokens): + '\n' โ†’ 0 + ' ' โ†’ 1 + ''' โ†’ 2 + ',' โ†’ 3 + '.' โ†’ 4 + ':' โ†’ 5 + 'O' โ†’ 6 + 'T' โ†’ 7 + 'W' โ†’ 8 + 'a' โ†’ 9 + ... + +Encoded "to be": [30, 21, 1, 12, 15] +Decoded back: "to be" + +Round-trip check passed. +``` + +(Your exact IDs will depend on the characters present in the corpus.) + +#### Exercises to try + +1. **Extend the corpus.** Download a larger text (e.g., from [Project Gutenberg](https://www.gutenberg.org/)) and see how the vocabulary grows. +2. **Handle unknown characters.** Modify `encode` to return an `Option` or use a special `` token instead of panicking. +3. **Measure sequence length.** Encode a paragraph and compare the number of tokens to the number of words. How much longer is the character-level encoding? + +> **Key takeaway:** A character-level tokeniser is just two hash maps โ€” one from characters to IDs and one from IDs to characters. The `from_corpus` method automatically builds the vocabulary from whatever text you give it. --- @@ -24,19 +350,526 @@ A hands-on course building a small GPT-1-style language model in Rust โ€” from r ### ยง4 Embeddings and Positional Encoding -๐Ÿšง *To be written โ€” see [edu-cw9v]* +Our tokeniser converts text into a sequence of integer IDs. But neural networks work with continuous vectors, not discrete integers. **Embeddings** bridge this gap by mapping each token ID to a dense vector of floating-point numbers. + +#### What is an embedding? + +An embedding is a **lookup table** โ€” a matrix of shape `(vocab_size, embed_dim)` where each row is a learnable vector representing one token. + +``` +Embedding table (vocab_size=5, embed_dim=4): + +Token ID 0 โ†’ [ 0.12, -0.34, 0.56, 0.78] +Token ID 1 โ†’ [-0.91, 0.23, 0.45, -0.67] +Token ID 2 โ†’ [ 0.33, 0.11, -0.88, 0.54] +Token ID 3 โ†’ [ 0.76, -0.55, 0.22, 0.13] +Token ID 4 โ†’ [-0.42, 0.89, -0.11, 0.66] +``` + +Given the input token IDs `[2, 0, 3]`, the embedding layer simply looks up rows 2, 0, and 3: + +``` +Input: [2, 0, 3] +Output: [[ 0.33, 0.11, -0.88, 0.54], โ† row 2 + [ 0.12, -0.34, 0.56, 0.78], โ† row 0 + [ 0.76, -0.55, 0.22, 0.13]] โ† row 3 +``` + +The result is a matrix of shape `(sequence_length, embed_dim)`. Initially, these vectors are random. During training, backpropagation adjusts them so that tokens with similar meanings end up with similar vectors. + +#### Why do we need embeddings? + +Integer IDs have no inherent structure โ€” the fact that "a" is token 0 and "b" is token 1 does not mean they are "close" in any useful sense. Embeddings give the model a **continuous space** where it can represent relationships. After training, you might find that: + +- Vowel characters cluster together. +- Uppercase and lowercase versions of the same letter are nearby. +- Punctuation characters form their own cluster. + +#### The embedding dimension + +The **embedding dimension** (`embed_dim` or `d_model`) is a hyperparameter you choose. It controls how much information each token vector can carry: + +| Embed dim | Capacity | Training cost | Typical use | +|-----------|----------|---------------|-------------| +| 16-64 | Low | Very fast | Toy models (like ours) | +| 128-512 | Medium | Moderate | Small-scale experiments | +| 768-4096 | High | Expensive | Production LLMs | + +For our tiny model, we will use `embed_dim = 64`. + +#### The position problem + +Consider two sentences: + +``` +"The cat ate the fish" +"The fish ate the cat" +``` + +These contain the exact same tokens but have very different meanings. If we only embed the tokens, the model has no way to tell the two sentences apart โ€” it does not know the **order** of the tokens. + +#### Positional encoding + +To inject position information, we add a **positional encoding** vector to each token embedding. The original Transformer paper ("Attention Is All You Need") proposed using fixed sinusoidal functions: + +\\[ PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\] +\\[ PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\] + +Where: +- `pos` is the position in the sequence (0, 1, 2, ...). +- `i` is the dimension index. +- `d_model` is the embedding dimension. + +These functions produce a unique pattern for each position. Low-frequency sinusoids change slowly across positions (capturing coarse position), while high-frequency sinusoids change rapidly (capturing fine position). + +``` +Position 0: [sin(0), cos(0), sin(0), cos(0), ...] = [0.00, 1.00, 0.00, 1.00, ...] +Position 1: [sin(1), cos(1), sin(ฮต), cos(ฮต), ...] = [0.84, 0.54, 0.01, 1.00, ...] +Position 2: [sin(2), cos(2), sin(2ฮต), cos(2ฮต),...] = [0.91, -0.42, 0.02, 1.00, ...] +``` + +(Here ฮต = 1/10000^(2/d_model), a very small number for higher dimensions.) + +#### Learned positional embeddings + +GPT-1 and GPT-2 use a simpler approach: a **second embedding table** of shape `(max_seq_len, embed_dim)` where each position gets its own learnable vector, just like tokens do. This is what we will implement: + +``` +Final input = token_embedding[token_id] + position_embedding[position] +``` + +Both the token embeddings and position embeddings are learned during training. + +#### Putting it all together + +The input pipeline for our model looks like this: + +``` +"hello" โ†’ [3, 2, 4, 4, 5] (tokenisation) + โ†’ [[0.12, ...], [-0.91, ...], (token embedding lookup) + [0.33, ...], [0.33, ...], + [-0.42, ...]] + + [[0.05, ...], [0.11, ...], (position embedding lookup) + [0.22, ...], [0.08, ...], + [0.17, ...]] + = [[0.17, ...], [-0.80, ...], (element-wise addition) + [0.55, ...], [0.41, ...], + [-0.25, ...]] +``` + +The result is a `(seq_len, embed_dim)` matrix that carries both **what** each token is and **where** it appears. This matrix is what we feed into the Transformer blocks. + +> **Key takeaway:** Embeddings convert token IDs into learnable vectors. Positional encodings add position information so the model can distinguish "the cat ate the fish" from "the fish ate the cat". Together, they form the input to the Transformer. ### ยง5 Self-Attention: Queries, Keys, and Values -๐Ÿšง *To be written โ€” see [edu-s6mr]* +Self-attention is the mechanism that makes Transformers work. It allows every token in a sequence to look at every other token and decide which ones are relevant. This is the single most important concept in this course. + +#### The intuition: a database lookup + +Think of self-attention as a **soft database lookup**: + +- Each token formulates a **query**: "What kind of information am I looking for?" +- Each token advertises a **key**: "Here is what kind of information I have." +- Each token holds a **value**: "Here is my actual information." + +To process a token, we compare its query against all keys. Where there is a strong match, we pull in the corresponding value. The result is a weighted combination of all values, where the weights reflect how relevant each token is. + +``` +Token: "sat" +Query: "I'm a verb โ€” who is my subject?" + +Keys available: + "The" โ†’ "I'm a determiner" (low match) + "cat" โ†’ "I'm a noun/subject" (HIGH match) + "sat" โ†’ "I'm a verb" (medium match) + "on" โ†’ "I'm a preposition" (low match) + +Result: mostly attend to "cat", somewhat to "sat", barely to others +``` + +Of course, the model does not literally think in words โ€” these "queries" and "keys" are learned vector representations. But the analogy captures the mechanism. + +#### The math: scaled dot-product attention + +Given a sequence of `n` token embeddings (each of dimension `d_model`), self-attention works as follows: + +**Step 1: Project into Q, K, V.** + +We use three learned weight matrices \\( W_Q, W_K, W_V \\) (each of shape `d_model ร— d_k`) to produce: + +\\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \\] + +Where \\( X \\) is the input matrix of shape `(n, d_model)` and \\( Q, K, V \\) are each of shape `(n, d_k)`. + +**Step 2: Compute attention scores.** + +We compute the dot product of each query with all keys: + +\\[ \text{scores} = Q K^T \\] + +This produces an `(n, n)` matrix where entry `(i, j)` measures how much token `i` should attend to token `j`. + +**Step 3: Scale.** + +We divide by \\( \sqrt{d_k} \\) to prevent the dot products from becoming too large (which would push softmax into regions with tiny gradients): + +\\[ \text{scaled\_scores} = \frac{Q K^T}{\sqrt{d_k}} \\] + +**Step 4: Softmax.** + +We apply softmax row-wise so that each row sums to 1, giving us a proper probability distribution: + +\\[ \text{attention\_weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) \\] + +**Step 5: Weighted sum of values.** + +We multiply the attention weights by the values: + +\\[ \text{output} = \text{attention\_weights} \times V \\] + +The complete formula is: + +\\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \\] + +#### A small numeric example + +Let us walk through attention with a tiny example. Suppose we have 3 tokens with `d_k = 2`: + +``` +Q (queries): K (keys): V (values): +[1.0, 0.0] [1.0, 0.0] [1.0, 0.0] +[0.0, 1.0] [0.0, 1.0] [0.0, 1.0] +[1.0, 1.0] [1.0, 1.0] [0.5, 0.5] +``` + +**Step 2 โ€” QK^T:** +``` + K^T = [1.0 0.0 1.0] + [0.0 1.0 1.0] + +QK^T = [1.0*1.0+0.0*0.0 1.0*0.0+0.0*1.0 1.0*1.0+0.0*1.0] [1.0 0.0 1.0] + [0.0*1.0+1.0*0.0 0.0*0.0+1.0*1.0 0.0*1.0+1.0*1.0] = [0.0 1.0 1.0] + [1.0*1.0+1.0*0.0 1.0*0.0+1.0*1.0 1.0*1.0+1.0*1.0] [1.0 1.0 2.0] +``` + +**Step 3 โ€” Scale by โˆšd_k = โˆš2 โ‰ˆ 1.414:** +``` +Scaled = [0.71 0.00 0.71] + [0.00 0.71 0.71] + [0.71 0.71 1.41] +``` + +**Step 4 โ€” Softmax (row-wise):** +``` +Row 0: softmax([0.71, 0.00, 0.71]) โ‰ˆ [0.39, 0.19, 0.42] +Row 1: softmax([0.00, 0.71, 0.71]) โ‰ˆ [0.19, 0.39, 0.42] +Row 2: softmax([0.71, 0.71, 1.41]) โ‰ˆ [0.24, 0.24, 0.52] +``` + +**Step 5 โ€” Multiply by V:** +``` +Output row 0 = 0.39*[1,0] + 0.19*[0,1] + 0.42*[0.5,0.5] = [0.60, 0.40] +Output row 1 = 0.19*[1,0] + 0.39*[0,1] + 0.42*[0.5,0.5] = [0.40, 0.60] +Output row 2 = 0.24*[1,0] + 0.24*[0,1] + 0.52*[0.5,0.5] = [0.50, 0.50] +``` + +Token 2 (whose query was `[1,1]` โ€” "I want everything") ends up with a balanced mixture `[0.50, 0.50]`. Token 0 (query `[1,0]`) ends up leaning toward the first dimension `[0.60, 0.40]`. The attention mechanism has routed information according to what each token asked for. + +#### Why self-attention works + +Self-attention has two properties that make it powerful: + +1. **Global context.** Every token can attend to every other token in a single step. In an RNN, information must flow through many sequential steps to get from one end of the sequence to the other. + +2. **Content-based routing.** Which tokens to attend to is determined by the content (via Q and K), not by fixed connectivity patterns. The model learns to route information dynamically. + +``` +RNN: information flows sequentially Attention: direct connections + t1 โ†’ t2 โ†’ t3 โ†’ t4 โ†’ t5 t1 โ†โ†’ t2 โ†โ†’ t3 โ†โ†’ t4 โ†โ†’ t5 + (5 steps from t1 to t5) (1 step between any pair) +``` + +> **Key takeaway:** Self-attention computes a weighted sum of value vectors, where the weights are determined by the similarity between query and key vectors. The formula \\(\text{softmax}(QK^T / \sqrt{d_k}) V\\) is the mathematical heart of the Transformer. ### ยง6 The Transformer Block -๐Ÿšง *To be written โ€” see [edu-9cnd]* +A single self-attention layer is powerful but not sufficient. The **Transformer block** wraps attention in a series of components that make training stable and learning more expressive. + +#### Components of a Transformer block + +A Transformer block contains four components, applied in order: + +1. **Multi-head self-attention** +2. **Residual connection + Layer normalisation** +3. **Feed-forward network (FFN)** +4. **Residual connection + Layer normalisation** + +Here is the full block in ASCII art: + +``` + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Input (x) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ”‚ (residual) + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ + โ”‚ Multi-Head โ”‚ โ”‚ + โ”‚ Self-Attention โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + โ”‚ โ”‚ + โ–ผ โ”‚ + (+) โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Layer Norm โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ”‚ (residual) + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ + โ”‚ Feed-Forward โ”‚ โ”‚ + โ”‚ Network (FFN) โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + โ”‚ โ”‚ + โ–ผ โ”‚ + (+) โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Layer Norm โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Output โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +Let us look at each component. + +#### Multi-head self-attention + +Instead of running a single attention computation, multi-head attention runs **multiple attention heads in parallel**, each with its own \\( W_Q, W_K, W_V \\) matrices. Each head can learn to focus on different types of relationships: + +- Head 1 might learn syntactic relationships (subject-verb agreement). +- Head 2 might learn positional proximity (nearby characters). +- Head 3 might learn semantic similarity. + +If `d_model = 64` and we use 4 heads, each head operates on `d_k = d_model / num_heads = 16` dimensions. The outputs of all heads are concatenated and projected back to `d_model`: + +\\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O \\] + +where each \\( \text{head}_i = \text{Attention}(X W_Q^i, X W_K^i, X W_V^i) \\). + +#### Residual connections + +A **residual connection** (or skip connection) adds the input of a sublayer directly to its output: + +\\[ \text{output} = \text{sublayer}(x) + x \\] + +This seemingly simple trick is crucial for training deep networks. It ensures that gradients can flow directly through the network during backpropagation, preventing the **vanishing gradient problem** that plagues deep architectures. Even if a sublayer learns nothing useful, the residual connection ensures the signal passes through unchanged. + +#### Layer normalisation + +**Layer normalisation** normalises the values across the feature dimension for each token independently: + +\\[ \text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta \\] + +Where \\( \mu \\) and \\( \sigma^2 \\) are the mean and variance computed across the feature dimension, and \\( \gamma, \beta \\) are learnable scale and shift parameters. + +Layer norm keeps activations in a reasonable range, which stabilises training. Without it, values can explode or collapse as they pass through many layers. + +#### Feed-forward network + +The FFN is a simple two-layer neural network applied independently to each token position: + +\\[ \text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2 \\] + +The inner dimension is typically 4x the model dimension (e.g., `d_model = 64` โ†’ `d_ff = 256`). The GELU activation function is a smooth approximation of ReLU used in GPT models. + +The FFN is where much of the model's "knowledge" is stored. While attention determines *what information to combine*, the FFN *transforms* that information. + +#### Pre-norm vs post-norm + +The original Transformer used **post-norm** (normalise after the residual add). GPT-2 and most modern models use **pre-norm** (normalise before the sublayer). Pre-norm is more stable to train: + +``` +Post-norm (original): LayerNorm(x + Sublayer(x)) +Pre-norm (GPT-2): x + Sublayer(LayerNorm(x)) +``` + +We will use pre-norm in our implementation. + +> **Key takeaway:** A Transformer block combines multi-head attention (for context mixing), a feed-forward network (for per-token transformation), residual connections (for gradient flow), and layer normalisation (for training stability). These four ingredients, repeated many times, give Transformers their power. ### ยง7 Exercise 2: Implement Self-Attention in Rust -๐Ÿšง *To be written โ€” see [edu-hufe]* +In this exercise, we implement **single-head scaled dot-product self-attention** using the `candle` tensor library. This is the core computation from ยง5 translated into Rust. + +#### Adding candle to your project + +Update your `Cargo.toml`: + +```toml +[package] +name = "llm-from-scratch" +version = "0.1.0" +edition = "2021" + +[dependencies] +candle-core = "0.8" +candle-nn = "0.8" +anyhow = "1" + +[profile.release] +opt-level = "z" +lto = true +strip = true +codegen-units = 1 +``` + +#### Single-head self-attention + +Create `src/attention.rs`: + +```rust +use candle_core::{Device, Result, Tensor, D}; + +/// Compute single-head scaled dot-product self-attention. +/// +/// # Arguments +/// * `q` - Query tensor of shape `(seq_len, d_k)` +/// * `k` - Key tensor of shape `(seq_len, d_k)` +/// * `v` - Value tensor of shape `(seq_len, d_k)` +/// +/// # Returns +/// Output tensor of shape `(seq_len, d_k)` โ€” the attention-weighted +/// combination of values. +pub fn scaled_dot_product_attention( + q: &Tensor, + k: &Tensor, + v: &Tensor, +) -> Result { + let d_k = q.dim(D::Minus1)? as f64; + + // Step 1: QK^T โ€” compute attention scores + // q: (seq_len, d_k), k^T: (d_k, seq_len) โ†’ scores: (seq_len, seq_len) + let scores = q.matmul(&k.t()?)?; + + // Step 2: Scale by sqrt(d_k) + let scaled = (scores / d_k.sqrt())?; + + // Step 3: Softmax along the last dimension (row-wise) + let weights = candle_nn::ops::softmax(&scaled, D::Minus1)?; + + // Step 4: Weighted sum of values + // weights: (seq_len, seq_len) ร— v: (seq_len, d_k) โ†’ (seq_len, d_k) + let output = weights.matmul(v)?; + + Ok(output) +} + +/// Project input through a linear layer (matrix multiply) to produce Q, K, or V. +/// +/// # Arguments +/// * `x` - Input tensor of shape `(seq_len, d_model)` +/// * `w` - Weight matrix of shape `(d_model, d_k)` +/// +/// # Returns +/// Projected tensor of shape `(seq_len, d_k)`. +pub fn project(x: &Tensor, w: &Tensor) -> Result { + x.matmul(w) +} +``` + +#### Testing it in `main.rs` + +Add the module and a test function to `src/main.rs`: + +```rust +mod attention; +mod tokeniser; + +use anyhow::Result; +use candle_core::{Device, Tensor}; + +fn demo_attention() -> Result<()> { + let device = &Device::Cpu; + + // Simulate 4 token embeddings, each of dimension 8 + let seq_len = 4; + let d_model = 8; + let d_k = 8; // Same as d_model for single-head + + // Random input "embeddings" + let x = Tensor::randn(0f32, 1.0, (seq_len, d_model), device)?; + + // Random projection weights (in a real model, these are learned) + let w_q = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?; + let w_k = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?; + let w_v = Tensor::randn(0f32, 1.0, (d_model, d_k), device)?; + + // Project input into Q, K, V + let q = attention::project(&x, &w_q)?; + let k = attention::project(&x, &w_k)?; + let v = attention::project(&x, &w_v)?; + + println!("Input shape: {:?}", x.shape()); + println!("Q shape: {:?}", q.shape()); + println!("K shape: {:?}", k.shape()); + println!("V shape: {:?}", v.shape()); + + // Compute attention + let output = attention::scaled_dot_product_attention(&q, &k, &v)?; + println!("Output shape: {:?}", output.shape()); + println!("\nAttention output:\n{}", output); + + Ok(()) +} + +fn main() { + if let Err(e) = demo_attention() { + eprintln!("Error: {}", e); + } +} +``` + +#### Expected output + +``` +Input shape: [4, 8] +Q shape: [4, 8] +K shape: [4, 8] +V shape: [4, 8] +Output shape: [4, 8] + +Attention output: +[[ 0.1234, -0.5678, ...], + [ 0.2345, -0.4567, ...], + [ 0.3456, -0.3456, ...], + [ 0.4567, -0.2345, ...]] +``` + +(Your exact numbers will differ because of random initialisation.) + +#### What to observe + +After running the code, notice: + +1. **Shape preservation.** The output has the same shape as the input โ€” `(seq_len, d_k)`. Each token position gets a new vector that is a weighted combination of all value vectors. +2. **Row similarity.** The output rows tend to be more similar to each other than the input rows. This is because attention mixes information across all positions. +3. **Softmax effect.** If you print the attention weights (the output of softmax), you will see that each row sums to 1.0 and typically has one or two dominant values. + +#### Exercises to try + +1. **Print the attention weights matrix.** After the softmax step, print the `(seq_len, seq_len)` weight matrix. Which tokens attend most strongly to which? +2. **Add a causal mask.** Before softmax, set the upper-triangle entries of the score matrix to negative infinity. This prevents each position from attending to future positions. (Hint: use `Tensor::ones` and `Tensor::tril` to build a mask.) +3. **Compare with and without scaling.** Remove the `/ d_k.sqrt()` and observe how the attention weights change โ€” they should become much more "peaky" (concentrated on one token). + +> **Key takeaway:** Self-attention in code is just three matrix multiplications (to project Q, K, V), one more multiply (QK^T), a scale, a softmax, and a final multiply by V. The `candle` crate provides all the tensor operations we need. --- @@ -44,11 +877,420 @@ A hands-on course building a small GPT-1-style language model in Rust โ€” from r ### ยง8 A Decoder-Only LM: Stacking Blocks and the Causal Mask -๐Ÿšง *To be written โ€” see [edu-vqxk]* +We now have all the pieces โ€” embeddings, positional encoding, and Transformer blocks. It is time to assemble them into a complete language model. We will build a **decoder-only** Transformer, the architecture used by GPT-1, GPT-2, GPT-3, and many other LLMs. + +#### Why "decoder-only"? + +The original Transformer (2017) had two halves: + +- An **encoder** that processes an input sequence (e.g., a French sentence). +- A **decoder** that generates an output sequence (e.g., the English translation), attending to both itself and the encoder output. + +GPT (2018) showed that you only need the **decoder** half. By training the decoder to predict the next token in a single sequence, you get a general-purpose language model. No encoder, no cross-attention โ€” just self-attention with a **causal mask**. + +``` +Original Transformer: Decoder-only (GPT): +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Encoder โ”‚โ†’โ”‚ Decoder โ”‚ โ”‚ Decoder โ”‚ +โ”‚ (bidir.) โ”‚ โ”‚ (causal) โ”‚ โ”‚ (causal) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +Used for: translation Used for: generation +``` + +#### The causal mask + +In a decoder-only model, each token can only attend to tokens **at or before** its position โ€” never to future tokens. This is essential because during generation, future tokens do not exist yet. + +We enforce this with a **causal mask** (also called a "look-ahead mask") โ€” a lower-triangular matrix that blocks attention to future positions: + +``` +Causal mask for sequence length 5: + + t0 t1 t2 t3 t4 +t0 [ 1 0 0 0 0 ] โ† t0 can only see t0 +t1 [ 1 1 0 0 0 ] โ† t1 can see t0, t1 +t2 [ 1 1 1 0 0 ] โ† t2 can see t0, t1, t2 +t3 [ 1 1 1 1 0 ] โ† t3 can see t0, t1, t2, t3 +t4 [ 1 1 1 1 1 ] โ† t4 can see everything +``` + +In practice, we set the masked positions (zeros above) to \\( -\infty \\) before the softmax step. Since \\( \text{softmax}(-\infty) = 0 \\), those positions get zero attention weight. + +\\[ \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + M\right) V \\] + +where \\( M \\) has 0 for allowed positions and \\( -\infty \\) for blocked positions. + +#### The full model architecture + +Our GPT-1-style model stacks all the components in this order: + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Input Token IDs โ”‚ +โ”‚ [4, 2, 7, 1] โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Token Embedding (lookup) โ”‚ +โ”‚ + Position Embedding โ”‚ +โ”‚ โ†’ (seq_len, d_model) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Transformer Block 1 โ”‚ +โ”‚ โ”Œโ”€ Masked Multi-Head Attention โ”€โ” โ”‚ +โ”‚ โ”‚ + Residual + LayerNorm โ”‚ โ”‚ +โ”‚ โ”‚ FFN + Residual + LayerNorm โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Transformer Block 2 โ”‚ +โ”‚ (same structure as Block 1) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + ... (N blocks total) + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Final Layer Norm โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Linear Projection (no bias) โ”‚ +โ”‚ (d_model โ†’ vocab_size) โ”‚ +โ”‚ Output: logits per token โ”‚ +โ”‚ โ†’ (seq_len, vocab_size) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +The output **logits** are raw (unnormalised) scores for each token in the vocabulary, at each position in the sequence. To get probabilities, we apply softmax. To get a loss, we compare these logits against the actual next tokens using cross-entropy. + +#### Hyperparameters for our model + +We keep things small enough to train on a CPU in seconds: + +| Hyperparameter | Value | Description | +|----------------|-------|-------------| +| `vocab_size` | ~65 | Number of unique characters (depends on corpus) | +| `d_model` | 64 | Embedding dimension | +| `n_heads` | 4 | Number of attention heads | +| `n_layers` | 2 | Number of Transformer blocks | +| `d_ff` | 256 | FFN inner dimension (4 ร— d_model) | +| `max_seq_len` | 128 | Maximum sequence length | + +This gives roughly 100K parameters โ€” tiny by modern standards, but sufficient to learn character-level patterns from a small corpus. + +#### How generation works + +Once trained, we generate text autoregressively: + +1. Start with a **prompt** (e.g., "The "). +2. Encode it to token IDs. +3. Run the model to get logits for the next position. +4. **Sample** a token from the probability distribution (or take the argmax). +5. Append the sampled token to the sequence. +6. Repeat from step 3. + +``` +Step 1: "The " โ†’ model โ†’ next token probabilities โ†’ sample 'c' +Step 2: "The c" โ†’ model โ†’ next token probabilities โ†’ sample 'a' +Step 3: "The ca" โ†’ model โ†’ next token probabilities โ†’ sample 't' +Step 4: "The cat" โ†’ model โ†’ next token probabilities โ†’ sample ' ' +... +``` + +We only need the logits at the **last** position to generate the next token, but the model processes the entire sequence at once (which is efficient during training). + +> **Key takeaway:** A decoder-only language model is a stack of Transformer blocks with causal masking, sandwiched between an embedding layer and a linear output projection. The causal mask ensures each position can only attend to past tokens, enabling autoregressive generation. ### ยง9 Exercise 3: Define the GPT-1-Style Model in `candle` -๐Ÿšง *To be written โ€” see [edu-ujs5]* +In this exercise, we define the full model architecture in Rust using `candle`. We will build the model struct by struct, from the bottom up. + +#### The overall structure + +We need these components: + +1. **`CausalSelfAttention`** โ€” multi-head attention with causal masking +2. **`FeedForward`** โ€” the two-layer FFN +3. **`TransformerBlock`** โ€” attention + FFN with residual connections and layer norm +4. **`Gpt1Model`** โ€” the full model: embeddings, N blocks, final projection + +#### Configuration + +First, define a config struct in `src/model.rs`: + +```rust +use candle_core::{DType, Device, Result, Tensor, D}; +use candle_nn::{ + embedding, layer_norm, linear, linear_no_bias, Embedding, LayerNorm, + Linear, Module, VarBuilder, +}; + +/// Configuration for our GPT-1-style model. +#[derive(Clone)] +pub struct GptConfig { + pub vocab_size: usize, + pub d_model: usize, + pub n_heads: usize, + pub n_layers: usize, + pub d_ff: usize, + pub max_seq_len: usize, +} + +impl GptConfig { + /// A tiny configuration suitable for CPU training. + pub fn tiny(vocab_size: usize) -> Self { + GptConfig { + vocab_size, + d_model: 64, + n_heads: 4, + n_layers: 2, + d_ff: 256, + max_seq_len: 128, + } + } +} +``` + +#### Causal self-attention + +```rust +/// Multi-head causal self-attention. +pub struct CausalSelfAttention { + qkv_proj: Linear, + out_proj: Linear, + n_heads: usize, + d_k: usize, +} + +impl CausalSelfAttention { + pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result { + let d_k = cfg.d_model / cfg.n_heads; + // Project Q, K, V in a single linear layer for efficiency. + let qkv_proj = linear(cfg.d_model, 3 * cfg.d_model, vb.pp("qkv_proj"))?; + let out_proj = linear(cfg.d_model, cfg.d_model, vb.pp("out_proj"))?; + Ok(Self { qkv_proj, out_proj, n_heads: cfg.n_heads, d_k }) + } + + pub fn forward(&self, x: &Tensor) -> Result { + let (seq_len, d_model) = (x.dim(0)?, x.dim(1)?); + + // Project to Q, K, V in one operation, then split + let qkv = self.qkv_proj.forward(x)?; // (seq_len, 3 * d_model) + let q = qkv.narrow(1, 0, d_model)?; + let k = qkv.narrow(1, d_model, d_model)?; + let v = qkv.narrow(1, 2 * d_model, d_model)?; + + // Reshape for multi-head: (seq_len, n_heads, d_k) then transpose + // to (n_heads, seq_len, d_k) for batched attention + let q = q.reshape((seq_len, self.n_heads, self.d_k))? + .transpose(0, 1)?; + let k = k.reshape((seq_len, self.n_heads, self.d_k))? + .transpose(0, 1)?; + let v = v.reshape((seq_len, self.n_heads, self.d_k))? + .transpose(0, 1)?; + + // Scaled dot-product attention: (n_heads, seq_len, seq_len) + let scale = (self.d_k as f64).sqrt(); + let scores = q.matmul(&k.transpose(1, 2)?)?.affine(1.0 / scale, 0.0)?; + + // Causal mask: set future positions to -inf + let mask = Tensor::ones((seq_len, seq_len), DType::F32, x.device())? + .tril(0)?; + let neg_inf = Tensor::ones_like(&mask)? + .affine(-1e9, 0.0)? + .affine(1.0, 0.0)?; + let mask = mask.where_cond( + &scores.broadcast_left(self.n_heads)?.squeeze(0)?, + &neg_inf, + ); + // Simpler approach: build additive mask + let additive_mask = Tensor::zeros((seq_len, seq_len), DType::F32, x.device())?; + // We need upper-triangular part to be -inf + let ones = Tensor::ones((seq_len, seq_len), DType::F32, x.device())?; + let causal = ones.tril(0)?; // lower triangle = 1, upper = 0 + // Convert: where causal==0, set to -1e9; where causal==1, set to 0 + let additive_mask = ((causal.affine(-1.0, 1.0))? // 0โ†’1, 1โ†’0 + .affine(1e9, 0.0))? // 0โ†’0, 1โ†’1e9 + .affine(-1.0, 0.0)?; // 0โ†’0, 1โ†’-1e9 + + let masked_scores = scores.broadcast_add(&additive_mask)?; + let weights = candle_nn::ops::softmax(&masked_scores, D::Minus1)?; + + // Weighted sum of values + let attn_out = weights.matmul(&v)?; // (n_heads, seq_len, d_k) + + // Reshape back: transpose โ†’ (seq_len, n_heads, d_k) โ†’ (seq_len, d_model) + let attn_out = attn_out.transpose(0, 1)? + .reshape((seq_len, d_model))?; + + // Output projection + self.out_proj.forward(&attn_out) + } +} +``` + +#### Feed-forward network + +```rust +/// Position-wise feed-forward network with GELU activation. +pub struct FeedForward { + up: Linear, + down: Linear, +} + +impl FeedForward { + pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result { + let up = linear(cfg.d_model, cfg.d_ff, vb.pp("up"))?; + let down = linear(cfg.d_ff, cfg.d_model, vb.pp("down"))?; + Ok(Self { up, down }) + } + + pub fn forward(&self, x: &Tensor) -> Result { + let h = self.up.forward(x)?.gelu()?; + self.down.forward(&h) + } +} +``` + +#### Transformer block (pre-norm) + +```rust +/// A single Transformer block with pre-norm residual connections. +pub struct TransformerBlock { + attn: CausalSelfAttention, + ffn: FeedForward, + ln1: LayerNorm, + ln2: LayerNorm, +} + +impl TransformerBlock { + pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result { + let attn = CausalSelfAttention::new(cfg, vb.pp("attn"))?; + let ffn = FeedForward::new(cfg, vb.pp("ffn"))?; + let ln1 = layer_norm(cfg.d_model, Default::default(), vb.pp("ln1"))?; + let ln2 = layer_norm(cfg.d_model, Default::default(), vb.pp("ln2"))?; + Ok(Self { attn, ffn, ln1, ln2 }) + } + + pub fn forward(&self, x: &Tensor) -> Result { + // Pre-norm: x + Attn(LayerNorm(x)) + let residual = x; + let h = self.ln1.forward(x)?; + let h = self.attn.forward(&h)?; + let x = (residual + h)?; + + // Pre-norm: x + FFN(LayerNorm(x)) + let residual = &x; + let h = self.ln2.forward(&x)?; + let h = self.ffn.forward(&h)?; + (residual + h) + } +} +``` + +#### The full GPT model + +```rust +/// A small GPT-1-style language model. +pub struct Gpt1Model { + token_emb: Embedding, + pos_emb: Embedding, + blocks: Vec, + ln_f: LayerNorm, + lm_head: Linear, +} + +impl Gpt1Model { + pub fn new(cfg: &GptConfig, vb: VarBuilder) -> Result { + let token_emb = embedding(cfg.vocab_size, cfg.d_model, vb.pp("token_emb"))?; + let pos_emb = embedding(cfg.max_seq_len, cfg.d_model, vb.pp("pos_emb"))?; + + let mut blocks = Vec::with_capacity(cfg.n_layers); + for i in 0..cfg.n_layers { + blocks.push(TransformerBlock::new(cfg, vb.pp(format!("block_{}", i)))?); + } + + let ln_f = layer_norm(cfg.d_model, Default::default(), vb.pp("ln_f"))?; + let lm_head = linear_no_bias(cfg.d_model, cfg.vocab_size, vb.pp("lm_head"))?; + + Ok(Self { token_emb, pos_emb, blocks, ln_f, lm_head }) + } + + /// Forward pass: token IDs โ†’ logits. + /// + /// # Arguments + /// * `token_ids` - 1D tensor of shape `(seq_len,)` with token IDs. + /// + /// # Returns + /// Logits tensor of shape `(seq_len, vocab_size)`. + pub fn forward(&self, token_ids: &Tensor) -> Result { + let seq_len = token_ids.dim(0)?; + + // Create position indices [0, 1, 2, ..., seq_len-1] + let positions = Tensor::arange(0u32, seq_len as u32, token_ids.device())?; + + // Embed tokens and positions, then add + let tok_emb = self.token_emb.forward(token_ids)?; + let pos_emb = self.pos_emb.forward(&positions)?; + let mut x = (tok_emb + pos_emb)?; + + // Pass through all Transformer blocks + for block in &self.blocks { + x = block.forward(&x)?; + } + + // Final layer norm + projection to vocabulary + let x = self.ln_f.forward(&x)?; + self.lm_head.forward(&x) + } +} +``` + +#### Testing the model + +In `main.rs`: + +```rust +mod model; + +use candle_core::{DType, Device, Tensor}; +use candle_nn::VarMap; + +fn main() -> anyhow::Result<()> { + let device = &Device::Cpu; + let varmap = VarMap::new(); + let vb = candle_nn::VarBuilder::from_varmap(&varmap, DType::F32, device); + + let cfg = model::GptConfig::tiny(65); // 65 characters in typical Shakespeare + let model = model::Gpt1Model::new(&cfg, vb)?; + + // Create a dummy input: 10 token IDs + let input = Tensor::new(&[0u32, 1, 2, 3, 4, 5, 6, 7, 8, 9], device)?; + let logits = model.forward(&input)?; + + println!("Input shape: {:?}", input.shape()); + println!("Output logits shape: {:?}", logits.shape()); + // Should be (10, 65) โ€” 10 positions, 65 vocabulary scores each + + Ok(()) +} +``` + +#### What to observe + +- The output shape should be `(seq_len, vocab_size)` โ€” one set of logits per input position. +- With random weights, the logits will be meaningless noise. Training (next section) will make them meaningful. +- The model processes the entire sequence in parallel โ€” this is the advantage of Transformers over RNNs. + +> **Key takeaway:** Our GPT model is built from composable structs: `CausalSelfAttention`, `FeedForward`, `TransformerBlock`, and `Gpt1Model`. Each handles one concern, and the `candle` crate provides the tensor operations and automatic differentiation we need for training. --- @@ -56,15 +1298,516 @@ A hands-on course building a small GPT-1-style language model in Rust โ€” from r ### ยง10 Cross-Entropy Loss and the Training Loop -๐Ÿšง *To be written โ€” see [edu-abdu]* +We have a model that takes token IDs and outputs logits. Now we need to teach it to output the *right* logits โ€” the ones that predict the next token accurately. This is where **training** comes in. + +#### The training objective + +Recall that our model outputs logits of shape `(seq_len, vocab_size)` โ€” a score for every token in the vocabulary, at every position. The training target is simple: at each position `i`, the correct next token is position `i + 1` in the input. + +``` +Input: [T, h, e, , c, a, t] +Target: [h, e, , c, a, t, .] +Position: 0 1 2 3 4 5 6 +``` + +At position 0, where the input is "T", the model should predict "h". At position 1, it should predict "e", and so on. We shift the input by one to create the targets. + +#### Cross-entropy loss + +**Cross-entropy loss** measures how far the model's predicted probability distribution is from the true distribution (where all probability mass is on the correct token). + +For a single position where the correct token has index \\( y \\): + +\\[ \mathcal{L} = -\log P(y) = -\log \frac{e^{z_y}}{\sum_j e^{z_j}} \\] + +Where \\( z_j \\) are the logits. Intuitively: + +- If the model assigns high probability to the correct token, \\( -\log P(y) \\) is small (close to 0). Good. +- If the model assigns low probability, \\( -\log P(y) \\) is large. Bad. + +We average this loss over all positions in the sequence and all sequences in the batch. + +#### A concrete example + +Suppose our vocabulary is `['a', 'b', 'c']` and the correct next token is `'b'` (index 1). The model outputs logits: + +``` +Logits: [2.0, 5.0, 1.0] +After softmax: [0.05, 0.93, 0.02] (e^2 / sum, e^5 / sum, e^1 / sum) +Loss: -log(0.93) = 0.07 (low loss โ€” model is confident and correct) +``` + +If the model were wrong: + +``` +Logits: [5.0, 1.0, 2.0] +After softmax: [0.93, 0.02, 0.05] +Loss: -log(0.02) = 3.91 (high loss โ€” model is confident but wrong) +``` + +#### Gradient descent + +To minimise the loss, we use **gradient descent**. The idea is: + +1. Compute the loss for a batch of data. +2. Compute the **gradient** of the loss with respect to every model parameter (backpropagation). +3. Update each parameter by subtracting a small multiple of its gradient: + \\[ \theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} \\] + where \\( \eta \\) is the **learning rate**. + +The learning rate is a critical hyperparameter: +- Too high: training is unstable, loss oscillates or diverges. +- Too low: training is painfully slow. +- A typical starting value for small models: `1e-3` to `3e-4`. + +#### The training loop + +The training loop repeats the following steps for many **epochs** (passes through the entire dataset): + +``` +for epoch in 1..=num_epochs: + for batch in dataset: + 1. Forward pass: logits = model(input_tokens) + 2. Compute loss: loss = cross_entropy(logits, target_tokens) + 3. Backward pass: compute gradients via backpropagation + 4. Update weights: optimizer.step() + 5. Zero gradients: optimizer.zero_grad() + print epoch loss +``` + +#### Batching + +For efficiency, we process multiple sequences at once in a **batch**. Instead of feeding one sequence at a time, we stack `batch_size` sequences into a matrix: + +- Input shape: `(batch_size, seq_len)` +- Output logits: `(batch_size, seq_len, vocab_size)` + +For our small model training on CPU, a batch size of 32-64 works well. + +#### The AdamW optimiser + +We will use **AdamW** โ€” a variant of the Adam optimiser with decoupled weight decay. Adam adapts the learning rate for each parameter based on the history of its gradients, which generally works much better than plain gradient descent. `candle` provides AdamW out of the box. + +``` +AdamW hyperparameters: + learning_rate: 3e-4 + beta1: 0.9 (momentum) + beta2: 0.999 (RMS of gradients) + weight_decay: 0.1 +``` + +> **Key takeaway:** Cross-entropy loss measures how well the model's predictions match the true next tokens. The training loop repeatedly computes this loss, computes gradients via backpropagation, and updates the model's parameters to reduce the loss. ### ยง11 Exercise 4: Train on a Small Text Corpus -๐Ÿšง *To be written โ€” see [edu-jybf]* +In this exercise, we put everything together: load a text corpus, create training data, and train our model. + +#### Preparing the data + +For training data, we will use a small text corpus โ€” a few kilobytes of Shakespeare works well. Create a file `data/input.txt` with some text, or use this approach to embed the data directly: + +```rust +/// Load and prepare training data. +/// Returns (tokeniser, input_ids) where input_ids is the entire +/// corpus encoded as a vector of token IDs. +fn load_data() -> (CharTokeniser, Vec) { + let text = "\ +First Citizen: +Before we proceed any further, hear me speak. + +All: +Speak, speak. + +First Citizen: +You are all resolved rather to die than to famish? + +All: +Resolved. resolved. + +First Citizen: +First, you know Caius Marcius is chief enemy to the people. + +All: +We know't, we know't. + +First Citizen: +Let us kill him, and we'll have corn at our own price. +Is't a verdict? + +All: +No more talking on't; let it be done: away, away! +"; + + let tok = CharTokeniser::from_corpus(text); + let ids = tok.encode(text); + (tok, ids) +} +``` + +You can replace this with a longer text for better results. The more data, the more patterns the model can learn. + +#### Creating batches + +We need to extract fixed-length chunks from the corpus for training: + +```rust +use candle_core::{DType, Device, Tensor}; + +/// Create a batch of (input, target) pairs from the corpus. +/// +/// Each input is a sequence of `seq_len` tokens. +/// Each target is the same sequence shifted by one position. +fn create_batch( + data: &[u32], + batch_size: usize, + seq_len: usize, + device: &Device, +) -> anyhow::Result<(Tensor, Tensor)> { + use rand::Rng; + let mut rng = rand::thread_rng(); + let max_start = data.len() - seq_len - 1; + + let mut inputs = Vec::with_capacity(batch_size * seq_len); + let mut targets = Vec::with_capacity(batch_size * seq_len); + + for _ in 0..batch_size { + let start = rng.gen_range(0..max_start); + for j in 0..seq_len { + inputs.push(data[start + j]); + targets.push(data[start + j + 1]); + } + } + + let inputs = Tensor::new(inputs.as_slice(), device)? + .reshape((batch_size, seq_len))?; + let targets = Tensor::new(targets.as_slice(), device)? + .reshape((batch_size, seq_len))?; + + Ok((inputs, targets)) +} +``` + +#### The training loop + +Here is the complete training loop. Note that our model's forward pass needs to be adjusted to handle batched input (a 2D tensor instead of 1D). For simplicity, we can process each sequence in the batch separately and stack the results: + +```rust +use candle_nn::{AdamW, Optimizer, ParamsAdamW, VarMap, VarBuilder}; + +fn train() -> anyhow::Result<()> { + let device = &Device::Cpu; + let (tok, data) = load_data(); + + println!("Corpus size: {} characters", data.len()); + println!("Vocabulary size: {}", tok.vocab_size()); + + // Model setup + let varmap = VarMap::new(); + let vb = VarBuilder::from_varmap(&varmap, DType::F32, device); + let cfg = GptConfig::tiny(tok.vocab_size()); + let model = Gpt1Model::new(&cfg, vb)?; + + // Optimiser + let params = ParamsAdamW { + lr: 3e-4, + weight_decay: 0.1, + ..Default::default() + }; + let mut opt = AdamW::new(varmap.all_vars(), params)?; + + // Training hyperparameters + let batch_size = 16; + let seq_len = 64; + let num_steps = 1000; + + println!("\nTraining for {} steps...\n", num_steps); + + for step in 1..=num_steps { + let (inputs, targets) = create_batch(&data, batch_size, seq_len, device)?; + + // Forward pass: process each sequence in the batch + let mut all_logits = Vec::new(); + for b in 0..batch_size { + let input_b = inputs.get(b)?; // (seq_len,) + let logits_b = model.forward(&input_b)?; // (seq_len, vocab_size) + all_logits.push(logits_b); + } + let logits = Tensor::stack(&all_logits, 0)?; // (batch, seq_len, vocab_size) + + // Reshape for cross-entropy: flatten batch and sequence dimensions + let vocab_size = tok.vocab_size(); + let logits_flat = logits.reshape((batch_size * seq_len, vocab_size))?; + let targets_flat = targets.reshape(batch_size * seq_len)?; + + // Cross-entropy loss + let log_probs = candle_nn::ops::log_softmax(&logits_flat, D::Minus1)?; + let targets_one_hot = targets_flat.to_dtype(DType::I64)?; + let loss = candle_nn::loss::cross_entropy(&logits_flat, &targets_one_hot)?; + + // Backward pass + optimiser step + opt.backward_step(&loss)?; + + if step % 100 == 0 || step == 1 { + let loss_val: f32 = loss.to_scalar()?; + println!("Step {:>4} | Loss: {:.4}", step, loss_val); + } + } + + println!("\nTraining complete!"); + Ok(()) +} +``` + +#### Add `rand` to your dependencies + +Update `Cargo.toml`: + +```toml +[dependencies] +candle-core = "0.8" +candle-nn = "0.8" +anyhow = "1" +rand = "0.8" +``` + +#### Expected output + +``` +Corpus size: 482 characters +Vocabulary size: 42 + +Training for 1000 steps... + +Step 1 | Loss: 3.7376 +Step 100 | Loss: 2.8412 +Step 200 | Loss: 2.3567 +Step 300 | Loss: 2.0134 +Step 400 | Loss: 1.8223 +Step 500 | Loss: 1.6891 +Step 600 | Loss: 1.5744 +Step 700 | Loss: 1.4832 +Step 800 | Loss: 1.4102 +Step 900 | Loss: 1.3523 +Step 1000 | Loss: 1.2987 + +Training complete! +``` + +The loss should decrease steadily. A random model starts with loss \\( \approx \ln(\text{vocab\_size}) \\) (for 42 tokens, that is \\( \ln(42) \approx 3.74 \\)). As the model trains, it learns character patterns and the loss drops. + +#### Tips for better results + +1. **Use more data.** Even a few pages of Shakespeare (50KB+) will dramatically improve generation quality. +2. **Train longer.** 1000 steps is a minimum โ€” try 5000 or 10000 for better results. +3. **Adjust the learning rate.** If loss plateaus, try reducing the learning rate. +4. **Increase model size.** With more data, you can increase `d_model` to 128 and `n_layers` to 4. + +> **Key takeaway:** The training loop repeatedly samples batches, computes the forward pass and cross-entropy loss, and updates weights via backpropagation. Watching the loss decrease is satisfying confirmation that the model is learning. ### ยง12 Exercise 5: Sample from the Model -๐Ÿšง *To be written โ€” see [edu-i76z]* +The payoff for all our work โ€” generating text from the trained model. In this exercise, we implement temperature-based sampling and generate text character by character. + +#### Temperature sampling + +After the model produces logits for the next token, we convert them to probabilities using softmax. The **temperature** parameter controls the randomness of sampling: + +\\[ P(t_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}} \\] + +Where \\( T \\) is the temperature: + +| Temperature | Effect | +|-------------|--------| +| T < 1.0 | **Sharper** distribution โ€” model picks high-probability tokens more often. More deterministic, less creative. | +| T = 1.0 | **Unmodified** distribution โ€” sample directly from learned probabilities. | +| T > 1.0 | **Flatter** distribution โ€” lower-probability tokens get a bigger share. More random, more creative. | +| T โ†’ 0 | Equivalent to **argmax** โ€” always pick the most likely token. | + +``` +Logits: [2.0, 5.0, 1.0] + +T = 1.0 โ†’ P: [0.05, 0.93, 0.02] (normal) +T = 0.5 โ†’ P: [0.00, 1.00, 0.00] (very peaked) +T = 2.0 โ†’ P: [0.18, 0.63, 0.19] (flattened) +``` + +#### Top-k sampling + +**Top-k sampling** restricts the choice to the `k` most probable tokens, setting all other probabilities to zero. This prevents the model from choosing extremely unlikely tokens (which can produce gibberish): + +``` +Logits (sorted): [5.0, 3.0, 2.0, 0.5, -1.0, -3.0] +Top-k (k=3): [5.0, 3.0, 2.0, -inf, -inf, -inf] +After softmax: [0.67, 0.24, 0.09, 0.0, 0.0, 0.0] +``` + +Combining temperature and top-k is the standard approach in practice. + +#### Implementation + +Add a sampling function to your project: + +```rust +use rand::distributions::Distribution; + +/// Sample a token ID from logits with temperature and optional top-k. +/// +/// # Arguments +/// * `logits` - 1D tensor of shape `(vocab_size,)` โ€” raw model output +/// * `temperature` - Controls randomness (lower = more deterministic) +/// * `top_k` - If Some(k), only consider the top k most likely tokens +fn sample_token( + logits: &Tensor, + temperature: f64, + top_k: Option, +) -> anyhow::Result { + let device = logits.device(); + let vocab_size = logits.dim(0)?; + + // Apply temperature + let scaled = if temperature != 1.0 { + (logits / temperature)? + } else { + logits.clone() + }; + + // Convert to Vec for manipulation + let mut logit_vec: Vec = scaled.to_vec1()?; + + // Apply top-k: set everything outside top-k to -inf + if let Some(k) = top_k { + let k = k.min(vocab_size); + let mut indexed: Vec<(usize, f32)> = logit_vec.iter() + .enumerate() + .map(|(i, &v)| (i, v)) + .collect(); + indexed.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + let threshold = indexed[k - 1].1; + for val in logit_vec.iter_mut() { + if *val < threshold { + *val = f32::NEG_INFINITY; + } + } + } + + // Softmax to get probabilities + let max_val = logit_vec.iter().cloned().fold(f32::NEG_INFINITY, f32::max); + let exps: Vec = logit_vec.iter().map(|&x| (x - max_val).exp()).collect(); + let sum: f32 = exps.iter().sum(); + let probs: Vec = exps.iter().map(|&x| (x / sum) as f64).collect(); + + // Sample from the distribution + let dist = rand::distributions::WeightedIndex::new(&probs)?; + let mut rng = rand::thread_rng(); + Ok(dist.sample(&mut rng) as u32) +} +``` + +#### The generation loop + +```rust +/// Generate text from the model autoregressively. +/// +/// # Arguments +/// * `model` - The trained GPT model +/// * `tok` - The tokeniser +/// * `prompt` - Starting text +/// * `max_tokens` - Number of tokens to generate +/// * `temperature` - Sampling temperature +/// * `top_k` - Optional top-k filtering +fn generate( + model: &Gpt1Model, + tok: &CharTokeniser, + prompt: &str, + max_tokens: usize, + temperature: f64, + top_k: Option, +) -> anyhow::Result { + let device = &Device::Cpu; + + // Encode the prompt + let mut token_ids = tok.encode(prompt); + let max_seq_len = 128; // Must match model config + + for _ in 0..max_tokens { + // Truncate to max_seq_len if needed (keep most recent tokens) + let start = if token_ids.len() > max_seq_len { + token_ids.len() - max_seq_len + } else { + 0 + }; + let context = &token_ids[start..]; + + // Forward pass + let input = Tensor::new(context, device)?; + let logits = model.forward(&input)?; + + // Get logits for the last position + let last_logits = logits.get(context.len() - 1)?; // (vocab_size,) + + // Sample next token + let next_token = sample_token(&last_logits, temperature, top_k)?; + token_ids.push(next_token); + } + + Ok(tok.decode(&token_ids)) +} +``` + +#### Putting it together + +After training, add generation: + +```rust +fn main() -> anyhow::Result<()> { + // ... training code from Exercise 4 ... + + println!("\n--- Generating text ---\n"); + + // Try different temperatures + for temp in [0.5, 0.8, 1.0, 1.5] { + let text = generate(&model, &tok, "First", 200, temp, Some(10))?; + println!("Temperature {:.1}:", temp); + println!("{}\n", text); + } + + Ok(()) +} +``` + +#### Expected output + +With a well-trained model on Shakespeare text: + +``` +Temperature 0.5: +First Citizen: +Before we proceed any further, hear me speak. +All: +Speak, speak. + +Temperature 1.0: +First Citizen: +Let us know the people are resolved to +the corn at our own kill him, and we +speak. + +Temperature 1.5: +First Civkzl: +aNo moye, ws arl't; he proceiw +Le usdn ferktie corn at mork +``` + +At low temperature, the model reproduces memorised text. At high temperature, it becomes creative but error-prone. Temperature 0.8-1.0 is usually the sweet spot. + +#### Exercises to try + +1. **Experiment with top-k.** Try `k = 1` (greedy), `k = 5`, `k = 10`, and `None` (no filtering). How does it affect output quality? +2. **Implement top-p (nucleus) sampling.** Instead of a fixed k, include tokens until their cumulative probability exceeds a threshold p (e.g., 0.9). +3. **Try different prompts.** How does the model respond to prompts it has seen vs. novel prompts? +4. **Measure perplexity.** Compute \\( e^{\text{average loss}} \\) on a held-out test set to quantify model quality. + +> **Key takeaway:** Text generation works by repeatedly running the model, sampling a token from the output distribution, and appending it. Temperature and top-k control the tradeoff between coherence and creativity. --- @@ -72,8 +1815,114 @@ A hands-on course building a small GPT-1-style language model in Rust โ€” from r ### ยง13 What Limits This Model? -๐Ÿšง *To be written โ€” see [edu-kkjc]* +We have built a working language model โ€” it can learn patterns in text and generate new text. But if you compare its output to ChatGPT or Claude, the gap is enormous. Let us understand why. + +#### Context length + +Our model has a **maximum context window of 128 tokens** (characters). It literally cannot "see" anything beyond the last 128 characters. Modern LLMs have context windows of 8K to 200K+ tokens (and those are subword tokens, each representing several characters). This means our model: + +- Cannot maintain coherence over long passages. +- Cannot reason about information that appeared more than a few words ago. +- Has no ability to follow instructions that exceed its window. + +#### Model size + +Our model has roughly **100,000 parameters**. For comparison: + +| Model | Parameters | Ratio to ours | +|-------|-----------|---------------| +| Our model | ~100K | 1x | +| GPT-1 (2018) | 117M | 1,170x | +| GPT-2 (2019) | 1.5B | 15,000x | +| GPT-3 (2020) | 175B | 1,750,000x | +| GPT-4 (2023) | ~1.8T (rumoured) | ~18,000,000x | + +With only 100K parameters, our model can memorise short character patterns but cannot learn grammar, semantics, or world knowledge. Larger models have more capacity to store and compose information. + +#### Training data + +We trained on a few hundred characters. Real LLMs train on **trillions of tokens** โ€” essentially the entire public internet, books, code repositories, and more. The sheer volume and diversity of data is what allows large models to: + +- Learn the structure of many languages. +- Absorb factual knowledge. +- Understand code, math, and reasoning patterns. + +#### Tokenisation + +Character-level tokenisation means our model sees one character per step. A 10-word sentence is ~50 tokens for us, but only ~10-15 tokens for a BPE tokeniser. This means: + +- Our model needs a longer context window for the same effective range. +- Longer sequences are more expensive to process (attention is \\( O(n^2) \\) in sequence length). +- The model must learn to spell โ€” it cannot take words as atomic units. + +Real LLMs use BPE (GPT) or SentencePiece (Llama) tokenisers with vocabularies of 32K-100K tokens. + +#### Training techniques we skipped + +Production LLMs use many techniques we did not cover: + +- **Learning rate scheduling.** A warm-up phase followed by cosine decay. +- **Gradient clipping.** Preventing exploding gradients by capping their magnitude. +- **Mixed precision training.** Using float16/bfloat16 for speed and memory efficiency. +- **Data parallelism and model parallelism.** Distributing training across hundreds of GPUs. +- **RLHF (Reinforcement Learning from Human Feedback).** Fine-tuning the model to follow instructions and be helpful, using human preference data. This is what makes ChatGPT and Claude *conversational*, rather than just completing text. +- **Supervised fine-tuning (SFT).** Training on curated instruction-response pairs before RLHF. + +#### What our model CAN do + +Despite its limitations, our model demonstrates every fundamental component of a modern LLM: + +1. **Tokenisation** โ€” converting text to numbers and back. +2. **Embeddings** โ€” learned vector representations of tokens and positions. +3. **Self-attention with causal masking** โ€” the core Transformer mechanism. +4. **Stacked Transformer blocks** โ€” depth through repeated application. +5. **Cross-entropy training** โ€” learning from next-token prediction. +6. **Autoregressive generation** โ€” producing text one token at a time. + +The jump from our model to GPT-4 is primarily one of **scale** โ€” more parameters, more data, more compute โ€” plus careful engineering and alignment techniques. The architecture is fundamentally the same. + +> **Key takeaway:** Our model is limited by context length, model size, training data, and tokenisation. But it contains every core component of production LLMs. The path from here to GPT-4 is primarily scaling and engineering, not architectural revolution. ### ยง14 Further Reading -๐Ÿšง *To be written โ€” see [edu-9sb7]* +This chapter covered the foundations. Here are resources to go deeper, organised by topic. + +#### Foundational papers + +- **"Attention Is All You Need"** (Vaswani et al., 2017) โ€” [arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762) โ€” The paper that introduced the Transformer architecture. Essential reading. +- **"Improving Language Understanding by Generative Pre-Training"** (Radford et al., 2018) โ€” [CDN link](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) โ€” The GPT-1 paper. Showed that a decoder-only Transformer pre-trained on language modeling can be fine-tuned for many tasks. +- **"Language Models are Unsupervised Multitask Learners"** (Radford et al., 2019) โ€” [CDN link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) โ€” The GPT-2 paper. Showed that scaling up GPT-1 leads to emergent few-shot abilities. +- **"Language Models are Few-Shot Learners"** (Brown et al., 2020) โ€” [arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165) โ€” The GPT-3 paper. Demonstrated that massive scale enables in-context learning. + +#### Tutorials and courses + +- **Andrej Karpathy's "Let's build GPT"** โ€” [youtube.com/watch?v=kCc8FmEb1nY](https://www.youtube.com/watch?v=kCc8FmEb1nY) โ€” A two-hour video building a GPT from scratch in Python/PyTorch. Excellent companion to this chapter, covering the same ideas in a different language. +- **Andrej Karpathy's "makemore" series** โ€” Builds character-level language models of increasing complexity, from bigrams to Transformers. Available on YouTube. +- **3Blue1Brown "But what is a neural network?"** โ€” [youtube.com/watch?v=aircAruvnKk](https://www.youtube.com/watch?v=aircAruvnKk) โ€” Beautiful visual explanations of the basics of neural networks, backpropagation, and gradient descent. +- **"The Illustrated Transformer"** by Jay Alammar โ€” [jalammar.github.io/illustrated-transformer/](https://jalammar.github.io/illustrated-transformer/) โ€” The best visual guide to the Transformer architecture. + +#### Rust ML ecosystem + +- **Candle** โ€” [github.com/huggingface/candle](https://github.com/huggingface/candle) โ€” The tensor framework we used in this chapter. Supports CPU and GPU, with a PyTorch-like API. +- **Candle documentation** โ€” [docs.rs/candle-core](https://docs.rs/candle-core) โ€” API reference for tensor operations. +- **Burn** โ€” [burn.dev](https://burn.dev/) โ€” Another Rust deep learning framework, with a different design philosophy (backend-agnostic). +- **tch-rs** โ€” [github.com/LaurentMazare/tch-rs](https://github.com/LaurentMazare/tch-rs) โ€” Rust bindings for PyTorch's C++ library (libtorch). More mature but requires a C++ dependency. + +#### Books + +- **"Deep Learning"** by Goodfellow, Bengio, and Courville โ€” [deeplearningbook.org](https://www.deeplearningbook.org/) โ€” The comprehensive textbook on deep learning fundamentals. +- **"Dive into Deep Learning"** โ€” [d2l.ai](https://d2l.ai/) โ€” Interactive, code-first textbook with implementations in multiple frameworks. +- **"Speech and Language Processing"** by Jurafsky and Martin โ€” [web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/) โ€” Covers NLP foundations including language modeling, with chapters on neural approaches. + +#### Topics to explore next + +Now that you understand the basics, here are natural next steps: + +1. **Subword tokenisation.** Implement BPE (Byte-Pair Encoding) to handle larger vocabularies efficiently. See the `tokenizers` crate by Hugging Face. +2. **GPU training.** Switch from `Device::Cpu` to `Device::Cuda` in candle to train on a GPU. This enables much larger models and datasets. +3. **Positional encodings.** Experiment with RoPE (Rotary Position Embeddings), which is used in Llama and most modern models. +4. **KV caching.** During generation, cache the key and value tensors from previous tokens to avoid redundant computation. This is essential for fast inference. +5. **Fine-tuning a pre-trained model.** Load a pre-trained model (e.g., a small Llama) in candle and fine-tune it on your own data. +6. **RLHF.** Study how reinforcement learning from human feedback transforms a language model into an assistant. + +> **Key takeaway:** The field of language modeling is vast and evolving rapidly. The fundamentals you learned in this chapter โ€” tokenisation, embeddings, attention, training โ€” are the foundation everything else builds on. Pick a direction that interests you and keep building. diff --git a/edu/src/shaders.md b/edu/src/shaders.md index 52dadca..0e6fd56 100644 --- a/edu/src/shaders.md +++ b/edu/src/shaders.md @@ -1,6 +1,6 @@ # Shader Programming with wgpu and WGSL -This document is a self-guided course on GPU shader programming. It is organised into six parts: the GPU execution model, setting up with `wgpu`, vertex and fragment shaders, textures and samplers, compute shaders, and a look at where to go next. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked ๐Ÿšง are stubs whose full content is tracked in a beans ticket โ€” follow the ticket ID to find the detailed learning objectives and instructions. +This document is a self-guided course on GPU shader programming. It is organised into six parts: the GPU execution model, setting up with `wgpu`, vertex and fragment shaders, textures and samplers, compute shaders, and a look at where to go next. Each section is either a reading lesson or a hands-on Rust programming exercise. --- @@ -48,19 +48,355 @@ This document is a self-guided course on GPU shader programming. It is organised ### 1. CPU vs GPU: parallel execution model -๐Ÿšง This section is a stub. Full content tracked in [edu-5g0l]. +To understand shader programming, you first need to understand *why GPUs exist* and how they differ from CPUs. The core difference comes down to a design trade-off: **latency vs throughput**. + +#### The CPU: a few powerful cores + +A modern CPU has a small number of cores โ€” typically 4 to 16 on a consumer chip. Each core is highly sophisticated: it has deep pipelines, branch predictors, out-of-order execution, and large caches. This design makes each individual core extremely fast at executing a single sequence of instructions. + +```text +CPU (8 cores) +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Core 0 โ”‚ โ”‚ Core 1 โ”‚ โ”‚ Core 2 โ”‚ โ”‚ Core 3 โ”‚ +โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ +โ”‚ OoO exec โ”‚ โ”‚ OoO exec โ”‚ โ”‚ OoO exec โ”‚ โ”‚ OoO exec โ”‚ +โ”‚ Branch โ”‚ โ”‚ Branch โ”‚ โ”‚ Branch โ”‚ โ”‚ Branch โ”‚ +โ”‚ pred. โ”‚ โ”‚ pred. โ”‚ โ”‚ pred. โ”‚ โ”‚ pred. โ”‚ +โ”‚ L1/L2 โ”‚ โ”‚ L1/L2 โ”‚ โ”‚ L1/L2 โ”‚ โ”‚ L1/L2 โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Core 4 โ”‚ โ”‚ Core 5 โ”‚ โ”‚ Core 6 โ”‚ โ”‚ Core 7 โ”‚ +โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ โ”‚ (complex)โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +CPUs are optimised for **low latency** โ€” finishing any single task as quickly as possible. This makes them ideal for general-purpose programming: parsing JSON, running game logic, managing operating system tasks. + +#### The GPU: thousands of simple cores + +A GPU takes the opposite approach. It packs thousands of tiny, simple cores onto a single chip. Each individual core is much less powerful than a CPU core โ€” no branch prediction, no out-of-order execution, minimal cache. But there are *so many of them* that the total throughput is enormous. + +```text +GPU (thousands of cores) +โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค +โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ ยท โ”‚ +โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ + Each ยท is a simple core. Thousands execute in parallel. +``` + +GPUs are optimised for **high throughput** โ€” processing millions of similar operations per second. Each individual operation might be slower than on a CPU, but the sheer volume of parallel work makes up for it. + +#### SIMD vs SIMT + +You may have heard of **SIMD** (Single Instruction, Multiple Data) on CPUs โ€” instructions like SSE or AVX that process 4 or 8 values at once in a single register. GPUs take this idea much further with **SIMT** (Single Instruction, Multiple Threads). + +In SIMT, groups of threads (called **warps** on NVIDIA or **wavefronts** on AMD) execute the *same instruction* at the *same time*, but each thread operates on *different data*. A typical warp is 32 threads wide. + +```text +SIMT execution (one warp of 32 threads): + + Instruction: multiply position by matrix + + Thread 0: vertex[0].pos * matrix โ†’ result[0] + Thread 1: vertex[1].pos * matrix โ†’ result[1] + Thread 2: vertex[2].pos * matrix โ†’ result[2] + ... + Thread 31: vertex[31].pos * matrix โ†’ result[31] + + All 32 threads execute the same multiply instruction + at the same clock cycle, on different vertex data. +``` + +This is why GPUs are perfect for graphics: every pixel on screen needs the same computation (run the fragment shader), just with different input coordinates. The same applies to vertex transformations, physics simulations, and many other tasks. + +#### When does the GPU win? + +The GPU excels when your problem has these characteristics: + +- **Data parallelism**: the same operation is applied to many independent data elements +- **Arithmetic intensity**: lots of math per memory access +- **Predictable control flow**: minimal branching (if/else) since all threads in a warp must take the same path + +Problems that are sequential, branch-heavy, or have complex data dependencies are better left on the CPU. + +> **Key takeaway**: CPUs are fast race cars โ€” great at finishing one task quickly. GPUs are cargo ships โ€” slower per trip, but they move enormous amounts of freight in parallel. Shader programming is the art of loading that cargo ship efficiently. --- ### 2. The programmable pipeline: vertex, fragment, compute shaders -๐Ÿšง This section is a stub. Full content tracked in [edu-r52d]. +Modern GPUs run a **programmable graphics pipeline** โ€” a fixed sequence of stages where some stages run programs you write (shaders) and others are handled automatically by the hardware. Understanding this pipeline is essential before writing any shader code. + +#### The graphics pipeline + +When you ask the GPU to draw a triangle, your data flows through several stages: + +```text + The Graphics Pipeline + ===================== + + CPU (your Rust code) + โ”‚ + โ”‚ Vertex data + draw call + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ VERTEX SHADER โ”‚ โ—„โ”€โ”€ Programmable (you write this) +โ”‚ Transforms each โ”‚ Runs once per vertex +โ”‚ vertex position โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PRIMITIVE โ”‚ โ—„โ”€โ”€ Fixed-function (hardware) +โ”‚ ASSEMBLY โ”‚ Connects vertices into +โ”‚ โ”‚ triangles, lines, or points +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ RASTERISATION โ”‚ โ—„โ”€โ”€ Fixed-function (hardware) +โ”‚ Determines which โ”‚ Converts triangles into +โ”‚ pixels a triangle โ”‚ fragments (candidate pixels) +โ”‚ covers โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ FRAGMENT SHADER โ”‚ โ—„โ”€โ”€ Programmable (you write this) +โ”‚ Computes the โ”‚ Runs once per fragment +โ”‚ colour of each โ”‚ (potential pixel) +โ”‚ fragment โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ OUTPUT MERGER โ”‚ โ—„โ”€โ”€ Fixed-function (hardware) +โ”‚ Depth test, blend โ”‚ Combines fragments into +โ”‚ with framebuffer โ”‚ the final image +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +#### The vertex shader + +The **vertex shader** runs once for every vertex you submit. Its primary job is to transform vertex positions from *model space* (the coordinates you defined your mesh in) to *clip space* (the coordinate system the GPU uses to determine what is on screen). + +A vertex shader typically receives input data โ€” position, colour, texture coordinates โ€” and outputs a transformed position plus any data that should be passed to the fragment shader. + +For example, a vertex shader might: +- Multiply the vertex position by a model-view-projection matrix +- Pass the vertex colour through to the next stage +- Compute lighting values at each vertex + +#### Rasterisation + +After the vertex shader runs and the GPU assembles vertices into triangles, **rasterisation** determines which screen pixels each triangle covers. This is not programmable โ€” the hardware handles it automatically. + +For each pixel covered by a triangle, the rasteriser generates a **fragment**. A fragment is a candidate pixel โ€” it carries interpolated values from the triangle's vertices (we will explore interpolation in detail in section 8). + +#### The fragment shader + +The **fragment shader** runs once for every fragment produced by rasterisation. Its job is to determine the final colour of that pixel. This is where most of the visual magic happens: texturing, lighting, shadows, reflections, and special effects are all implemented in the fragment shader. + +The fragment shader receives interpolated data from the vertex shader (like colour or texture coordinates) and outputs a colour value, typically as an RGBA (red, green, blue, alpha) tuple. + +#### Compute shaders: a separate path + +**Compute shaders** do not participate in the graphics pipeline at all. They are general-purpose programs that run on the GPU, independent of any rendering. You dispatch them with explicit work-group sizes and they can read from and write to buffers and textures. + +```text + Compute Pipeline (independent of graphics) + ========================================== + + CPU (your Rust code) + โ”‚ + โ”‚ Dispatch (work group counts) + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ COMPUTE SHADER โ”‚ โ—„โ”€โ”€ Programmable (you write this) +โ”‚ General-purpose โ”‚ Runs once per invocation +โ”‚ parallel work โ”‚ across work groups +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + Output buffers / textures +``` + +Compute shaders are used for physics simulations, image processing, machine learning inference, procedural generation, and any task that benefits from massive parallelism but does not need the rasterisation pipeline. + +> **Key takeaway**: The GPU has two paths for running your code. The **graphics pipeline** flows from vertex shader through rasterisation to fragment shader, producing pixels on screen. The **compute pipeline** is a separate, general-purpose path for parallel computation. You will write programs for all three shader types in this course. --- ### 3. What is WGSL? Syntax overview -๐Ÿšง This section is a stub. Full content tracked in [edu-2ak3]. +**WGSL** (WebGPU Shading Language) is the shader language used by the WebGPU API โ€” and by extension, by `wgpu`. If you have used GLSL or HLSL before, WGSL will feel familiar but with a more explicit, Rust-influenced syntax. If you are new to shader languages, this section covers everything you need to get started. + +#### Scalar types + +WGSL provides a small set of scalar types: + +| Type | Description | +|--------|--------------------------------| +| `f32` | 32-bit floating point | +| `f16` | 16-bit floating point (optional feature) | +| `i32` | 32-bit signed integer | +| `u32` | 32-bit unsigned integer | +| `bool` | Boolean | + +#### Vector types + +Vectors are fundamental in shader programming. WGSL supports vectors of 2, 3, or 4 components: + +```wgsl +var a: vec2 = vec2(1.0, 2.0); +var b: vec3 = vec3(1.0, 0.0, 0.0); // a red colour or a direction +var c: vec4 = vec4(0.2, 0.4, 0.8, 1.0); // RGBA colour + +// Shorthand constructors (type inference): +var d = vec3f(1.0, 0.0, 0.0); // vec3 +var e = vec4f(0.0, 0.0, 0.0, 1.0); +``` + +You can access components with **swizzling**: + +```wgsl +var color = vec4f(1.0, 0.5, 0.2, 1.0); +var rgb = color.rgb; // vec3f(1.0, 0.5, 0.2) +var rr = color.xx; // vec2f(1.0, 1.0) +``` + +Components can be accessed as `x/y/z/w` or `r/g/b/a` โ€” they are interchangeable aliases. + +#### Matrix types + +Matrices are used for transformations (rotation, scaling, projection): + +```wgsl +// A 4x4 matrix of f32 values (4 columns, 4 rows) +var transform: mat4x4; + +// A 3x3 matrix +var rotation: mat3x3; +``` + +Matrix-vector multiplication uses the `*` operator: `transform * vec4f(pos, 1.0)`. + +#### Variables: `let` vs `var` + +```wgsl +// `let` declares an immutable binding (like Rust's `let`) +let pi = 3.14159; + +// `var` declares a mutable variable +var counter: u32 = 0u; +counter = counter + 1u; +``` + +#### Structs + +Structs group related data, and they are used extensively for shader inputs and outputs: + +```wgsl +struct VertexInput { + @location(0) position: vec3, + @location(1) color: vec3, +} + +struct VertexOutput { + @builtin(position) clip_position: vec4, + @location(0) color: vec3, +} +``` + +The `@location(n)` attribute links struct fields to specific slots in the vertex buffer layout or inter-stage communication. The `@builtin(position)` attribute tells the GPU this field is the clip-space position. + +#### Functions and entry points + +WGSL functions look like this: + +```wgsl +fn add(a: f32, b: f32) -> f32 { + return a + b; +} +``` + +**Entry points** are functions marked with a stage attribute: + +```wgsl +@vertex +fn vs_main(in: VertexInput) -> VertexOutput { + var out: VertexOutput; + out.clip_position = vec4f(in.position, 1.0); + out.color = in.color; + return out; +} + +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return vec4f(in.color, 1.0); +} + +@compute @workgroup_size(64) +fn cs_main(@builtin(global_invocation_id) id: vec3) { + // compute work here +} +``` + +The `@location(0)` on the fragment shader return type means "write to the first colour attachment" (the render target). + +#### Built-in attributes + +Some commonly used built-in attributes: + +| Attribute | Stage | Meaning | +|-----------|-------|---------| +| `@builtin(position)` | Vertex out / Fragment in | Clip-space position / fragment coordinates | +| `@builtin(vertex_index)` | Vertex | Index of the current vertex | +| `@builtin(instance_index)` | Vertex | Index of the current instance | +| `@builtin(global_invocation_id)` | Compute | 3D index of this thread in the dispatch | +| `@builtin(local_invocation_id)` | Compute | 3D index within the work group | + +#### Binding resources + +Uniforms, storage buffers, textures, and samplers are declared at module scope with `@group` and `@binding` attributes: + +```wgsl +@group(0) @binding(0) +var time: f32; + +@group(0) @binding(1) +var texture: texture_2d; + +@group(0) @binding(2) +var tex_sampler: sampler; +``` + +The `@group(n)` corresponds to a bind group index, and `@binding(n)` is the binding within that group. These must match the bind group layout you define on the Rust side. + +#### Control flow + +WGSL supports `if`/`else`, `for`, `while`, `loop`, `switch`, `break`, `continue`, and `return`: + +```wgsl +for (var i: u32 = 0u; i < 10u; i = i + 1u) { + if i == 5u { + continue; + } + // do work +} +``` + +> **Key takeaway**: WGSL's syntax is a blend of Rust and C-family languages. Types are explicit, entry points are marked with stage attributes (`@vertex`, `@fragment`, `@compute`), and data flows between stages via structs annotated with `@location` and `@builtin`. You will write WGSL for every exercise in this course. --- @@ -68,19 +404,438 @@ This document is a self-guided course on GPU shader programming. It is organised ### 4. What is wgpu? Cross-platform graphics API in Rust -๐Ÿšง This section is a stub. Full content tracked in [edu-j35d]. +**wgpu** is a Rust crate that implements the WebGPU API specification. It provides a safe, cross-platform interface for GPU programming that works on multiple backends: + +| Backend | Platform | +|---------|----------| +| Vulkan | Linux, Windows, Android | +| Metal | macOS, iOS | +| DX12 | Windows | +| WebGPU | Web browsers (via wasm) | +| OpenGL | Fallback for older systems | + +This means you write your GPU code once and it runs everywhere โ€” on desktop, on mobile, and in the browser. + +#### Why not raw Vulkan/Metal/DX12? + +Writing directly against a low-level graphics API like Vulkan requires thousands of lines of boilerplate before you can draw a single triangle. Vulkan's explicit nature gives you maximum control, but the complexity is enormous. wgpu provides a higher-level abstraction that handles the platform differences and much of the boilerplate while still being close enough to the metal for serious work. + +#### Key types in wgpu + +Here are the core types you will interact with, in the order you typically create them: + +```text +Initialization Flow +==================== + + Instance + โ”‚ + โ”‚ enumerate adapters + โ–ผ + Adapter โ†โ”€โ”€ represents a physical GPU + โ”‚ + โ”‚ request device + โ–ผ + Device + Queue + โ”‚ โ”‚ + โ”‚ โ”‚ submit commands + โ”‚ โ–ผ + โ”‚ (GPU execution) + โ”‚ + โ”‚ create resources + โ–ผ + Buffers, Textures, Pipelines, Bind Groups, ... +``` + +- **`Instance`**: the entry point to wgpu. Created first, used to find adapters and create surfaces. +- **`Surface`**: a handle to a window's drawable area. Created from a window (provided by a windowing library like `winit`). +- **`Adapter`**: represents a physical GPU. You request one from the instance, optionally specifying preferences (power preference, compatibility with your surface). +- **`Device`**: a logical connection to the GPU. You create resources (buffers, textures, pipelines) through the device. Think of it as an open connection to the GPU. +- **`Queue`**: used to submit work (command buffers) to the GPU. You get a queue together with the device. +- **`CommandEncoder`**: records GPU commands (render passes, compute dispatches, buffer copies) into a command buffer. The command buffer is then submitted to the queue. +- **`RenderPipeline`**: describes the full configuration for rendering โ€” which shaders to use, vertex layout, blending mode, pixel format, etc. +- **`Buffer`**: a block of GPU-accessible memory. Used for vertex data, index data, uniforms, storage, etc. +- **`BindGroup`**: a collection of resources (buffers, textures, samplers) that are made available to shaders. Corresponds to `@group(n)` in WGSL. + +#### The initialisation sequence in code + +Here is a simplified view of wgpu initialisation (we will see the full code in Exercise 1): + +```rust +// 1. Create an instance +let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default()); + +// 2. Create a surface from a window +let surface = instance.create_surface(&window)?; + +// 3. Request an adapter (physical GPU) +let adapter = instance + .request_adapter(&wgpu::RequestAdapterOptions { + power_preference: wgpu::PowerPreference::default(), + compatible_surface: Some(&surface), + force_fallback_adapter: false, + }) + .await + .unwrap(); + +// 4. Request a device and queue +let (device, queue) = adapter + .request_device(&wgpu::DeviceDescriptor::default(), None) + .await + .unwrap(); + +// 5. Configure the surface +let config = surface.get_default_config(&adapter, width, height).unwrap(); +surface.configure(&device, &config); +``` + +After this, you are ready to create pipelines, buffers, and start rendering. + +> **Key takeaway**: wgpu is a cross-platform GPU abstraction for Rust. You create an `Instance`, get an `Adapter` (physical GPU), open a `Device` + `Queue`, and then create resources and submit commands. This same code works on Vulkan, Metal, DX12, and WebGPU. --- ### 5. Exercise 1: create a window and clear it to a colour -๐Ÿšง This section is a stub. Full content tracked in [edu-6jjp]. +In this exercise you will create a window using `winit`, initialise `wgpu`, and fill the window with a solid colour (cornflower blue). This is the "hello world" of GPU programming. + +#### Step 1: project setup + +Create a new Rust project and add the required dependencies to `Cargo.toml`: + +```toml +[package] +name = "shader-exercises" +version = "0.1.0" +edition = "2021" + +[dependencies] +wgpu = "24" +winit = "30" +pollster = "0.4" +log = "0.4" +env_logger = "0.11" + +[profile.release] +opt-level = "z" +lto = true +strip = true +codegen-units = 1 +``` + +- **`wgpu`**: the GPU abstraction layer +- **`winit`**: cross-platform window creation and event handling +- **`pollster`**: a minimal async executor to block on futures (wgpu uses async for initialisation) +- **`env_logger`**: so wgpu can report errors and warnings + +#### Step 2: the complete code + +```rust +use winit::{ + application::ApplicationHandler, + event::WindowEvent, + event_loop::EventLoop, + window::{Window, WindowAttributes}, +}; +use std::sync::Arc; + +/// Holds all wgpu state needed for rendering. +struct GpuState { + surface: wgpu::Surface<'static>, + device: wgpu::Device, + queue: wgpu::Queue, + config: wgpu::SurfaceConfiguration, +} + +/// The main application struct. +struct App { + window: Option>, + gpu: Option, +} + +impl App { + fn new() -> Self { + Self { + window: None, + gpu: None, + } + } + + /// Initialise wgpu with the given window. + fn init_gpu(&mut self, window: Arc) { + let size = window.inner_size(); + let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor::default()); + + let surface = instance.create_surface(window.clone()).unwrap(); + + let adapter = pollster::block_on(instance.request_adapter( + &wgpu::RequestAdapterOptions { + power_preference: wgpu::PowerPreference::default(), + compatible_surface: Some(&surface), + force_fallback_adapter: false, + }, + )) + .expect("Failed to find a suitable GPU adapter"); + + let (device, queue) = pollster::block_on(adapter.request_device( + &wgpu::DeviceDescriptor::default(), + None, + )) + .expect("Failed to create device"); + + let config = surface + .get_default_config(&adapter, size.width.max(1), size.height.max(1)) + .expect("Surface is not supported by the adapter"); + surface.configure(&device, &config); + + self.gpu = Some(GpuState { + surface, + device, + queue, + config, + }); + } + + /// Render a single frame: clear the screen to cornflower blue. + fn render(&self) { + let gpu = self.gpu.as_ref().unwrap(); + + // Get the next frame's texture to draw on + let output = gpu.surface.get_current_texture() + .expect("Failed to get surface texture"); + let view = output.texture.create_view(&Default::default()); + + // Create a command encoder to record GPU commands + let mut encoder = gpu.device.create_command_encoder( + &wgpu::CommandEncoderDescriptor { + label: Some("Clear Encoder"), + }, + ); + + // Begin a render pass that clears to cornflower blue + { + let _render_pass = encoder.begin_render_pass( + &wgpu::RenderPassDescriptor { + label: Some("Clear Pass"), + color_attachments: &[Some( + wgpu::RenderPassColorAttachment { + view: &view, + resolve_target: None, + ops: wgpu::Operations { + load: wgpu::LoadOp::Clear( + wgpu::Color { + r: 0.392, + g: 0.584, + b: 0.929, + a: 1.0, + }, + ), + store: wgpu::StoreOp::Store, + }, + }, + )], + depth_stencil_attachment: None, + ..Default::default() + }, + ); + // The render pass is dropped here, ending it + } + + // Submit the commands to the GPU + gpu.queue.submit(std::iter::once(encoder.finish())); + + // Present the frame on screen + output.present(); + } +} + +impl ApplicationHandler for App { + fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) { + if self.window.is_none() { + let attrs = WindowAttributes::default() + .with_title("Exercise 1: Cornflower Blue"); + let window = Arc::new( + event_loop.create_window(attrs).unwrap() + ); + self.init_gpu(window.clone()); + self.window = Some(window); + } + } + + fn window_event( + &mut self, + event_loop: &winit::event_loop::ActiveEventLoop, + _window_id: winit::window::WindowId, + event: WindowEvent, + ) { + match event { + WindowEvent::CloseRequested => { + event_loop.exit(); + } + WindowEvent::Resized(new_size) => { + if let Some(gpu) = &mut self.gpu { + gpu.config.width = new_size.width.max(1); + gpu.config.height = new_size.height.max(1); + gpu.surface.configure(&gpu.device, &gpu.config); + } + } + WindowEvent::RedrawRequested => { + self.render(); + if let Some(window) = &self.window { + window.request_redraw(); + } + } + _ => {} + } + } +} + +fn main() { + env_logger::init(); + let event_loop = EventLoop::new().unwrap(); + let mut app = App::new(); + event_loop.run_app(&mut app).unwrap(); +} +``` + +#### Step 3: run it + +```sh +cargo run +``` + +You should see a window filled with **cornflower blue** (a pleasant mid-blue, `rgb(100, 149, 237)`). The window responds to resizing and closes when you click the close button. + +#### What just happened? + +Let's break down the key parts: + +1. **Window creation**: `winit` creates a native window. We wrap it in `Arc` so wgpu can reference it. +2. **Surface**: created from the window โ€” this is where rendered frames go. +3. **Adapter + Device + Queue**: we find a GPU, open a logical device, and get a command queue. +4. **Surface configuration**: tells the surface what pixel format and size to use. +5. **Render loop**: every frame we create a `CommandEncoder`, begin a `RenderPass` with a clear colour, end the pass, submit commands, and present. + +The clear colour is specified as `wgpu::Color { r, g, b, a }` with values in the 0.0-1.0 range. + +**Try this**: change the colour to something else โ€” pure red `(1.0, 0.0, 0.0, 1.0)`, bright green, or your favourite colour. Rebuild and see the change. --- ### 6. The render loop: swap chains, frames, command encoders -๐Ÿšง This section is a stub. Full content tracked in [edu-hrfy]. +Now that you have a working window, let's dive deeper into what happens each frame. Understanding the render loop is crucial because every shader program you write will run inside this cycle. + +#### The frame lifecycle + +Every frame follows the same sequence. Here is what happens between one screen update and the next: + +```text +Frame Lifecycle +=============== + + Time โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ + + โ”Œโ”€โ”€โ”€โ”€ Frame N โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€ Frame N+1 โ”€โ”€ + โ”‚ โ”‚ โ”‚ + โ”‚ 1. Acquire 2. Record 3. Submit 4. Present + โ”‚ surface commands to to + โ”‚ texture (render queue screen + โ”‚ pass) + โ”‚ โ”‚ โ”‚ + โ”‚ CPU CPU โ”‚ GPU executes + โ”‚ side side โ”‚ asynchronously + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +``` + +#### Step 1: acquire a surface texture + +```rust +let output = surface.get_current_texture()?; +let view = output.texture.create_view(&Default::default()); +``` + +The surface manages a small pool of textures (typically 2-3, called a **swap chain**). When you call `get_current_texture()`, you receive the next available texture to draw on. While you are drawing on texture A, the GPU may still be displaying the previous texture B on screen โ€” this is **double buffering**. + +```text +Double Buffering +================ + + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Texture A โ”‚ โ”‚ Texture B โ”‚ + โ”‚ (drawing) โ”‚ โ”‚ (on screen)โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ–ฒ โ–ฒ + โ”‚ โ”‚ + You render Monitor + into this displays + one now this one +``` + +After you present texture A, the roles swap: A goes to the screen and B becomes available for the next frame. + +#### Step 2: record commands with a command encoder + +```rust +let mut encoder = device.create_command_encoder(&Default::default()); +``` + +The `CommandEncoder` is like a tape recorder for GPU commands. You do not execute anything immediately โ€” you record a list of operations, and then submit them all at once. This is called a **command buffer** model. + +Why not execute commands immediately? Because the GPU operates asynchronously. Batching commands into a buffer lets the GPU execute them efficiently without constant back-and-forth with the CPU. + +#### Step 3: begin a render pass + +```rust +let render_pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor { + color_attachments: &[Some(wgpu::RenderPassColorAttachment { + view: &view, + ops: wgpu::Operations { + load: wgpu::LoadOp::Clear(clear_color), + store: wgpu::StoreOp::Store, + }, + ..Default::default() + })], + ..Default::default() +}); +``` + +A **render pass** is a sequence of draw commands that all target the same set of attachments (colour textures, depth buffers). Within a render pass, you: +- Set the pipeline +- Bind vertex buffers and bind groups +- Issue draw calls + +The `load` operation specifies what happens to the attachment at the start of the pass. `LoadOp::Clear(color)` fills it with a solid colour. `LoadOp::Load` preserves the previous contents. + +The `store` operation specifies what happens at the end. `StoreOp::Store` keeps the results; `StoreOp::Discard` throws them away (useful for depth buffers you do not need after the pass). + +#### Step 4: submit and present + +```rust +// End the render pass (drop it) +drop(render_pass); + +// Finish recording and get a command buffer +let command_buffer = encoder.finish(); + +// Submit the command buffer to the GPU +queue.submit(std::iter::once(command_buffer)); + +// Show the rendered frame on screen +output.present(); +``` + +`queue.submit()` sends the command buffer to the GPU for execution. The GPU processes it asynchronously โ€” your CPU code continues immediately. `output.present()` tells the surface to display this texture once the GPU finishes rendering to it. + +#### Multiple render passes + +You can have multiple render passes in a single frame. This is common for: +- **Shadow mapping**: render the scene from a light's perspective (pass 1), then render the final image using the shadow map (pass 2) +- **Post-processing**: render the scene to a texture (pass 1), then apply a blur filter to that texture and draw the result to the screen (pass 2) + +Each pass gets its own `begin_render_pass` / drop cycle within the same command encoder. + +> **Key takeaway**: each frame, you acquire a surface texture, record GPU commands into a command encoder (including one or more render passes), submit the commands to the queue, and present the result. The CPU and GPU work asynchronously โ€” the CPU records commands for the next frame while the GPU executes the current one. --- @@ -88,25 +843,718 @@ This document is a self-guided course on GPU shader programming. It is organised ### 7. Vertices, buffers, and the vertex shader -๐Ÿšง This section is a stub. Full content tracked in [edu-3l9h]. +To draw anything beyond a solid colour, you need to send **geometry** to the GPU. Geometry is made of vertices โ€” points in space that define the corners of triangles. This section explains how vertex data flows from your Rust code to the vertex shader on the GPU. + +#### What is a vertex? + +A vertex is a point with associated data. At minimum, a vertex has a **position**, but it usually carries additional attributes: + +```text +Vertex Data (per vertex) +======================== + + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ position: vec3 (x, y, z) โ”‚ + โ”‚ color: vec3 (r, g, b) โ”‚ + โ”‚ uv: vec2 (texture coordinate) โ”‚ + โ”‚ normal: vec3 (surface direction) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +For a simple coloured triangle, you might have three vertices with position and colour: + +```rust +#[repr(C)] +#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)] +struct Vertex { + position: [f32; 3], + color: [f32; 3], +} + +const VERTICES: &[Vertex] = &[ + Vertex { position: [ 0.0, 0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top, red + Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // left, green + Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // right, blue +]; +``` + +The `#[repr(C)]` attribute ensures the struct has a predictable memory layout matching what the GPU expects. The `bytemuck` derives let us safely cast the struct to raw bytes. + +#### Vertex buffers + +To get vertex data onto the GPU, you create a **vertex buffer**: + +```rust +use wgpu::util::DeviceExt; + +let vertex_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor { + label: Some("Vertex Buffer"), + contents: bytemuck::cast_slice(VERTICES), + usage: wgpu::BufferUsages::VERTEX, +}); +``` + +This copies the vertex data from CPU memory into GPU memory. The `VERTEX` usage flag tells wgpu that this buffer will be used as a vertex buffer. + +#### Vertex buffer layout + +The GPU does not know the structure of your vertex data โ€” you must describe it with a **vertex buffer layout**: + +```rust +let vertex_layout = wgpu::VertexBufferLayout { + array_stride: std::mem::size_of::() as u64, + step_mode: wgpu::VertexStepMode::Vertex, + attributes: &[ + // position: 3 floats at offset 0 + wgpu::VertexAttribute { + format: wgpu::VertexFormat::Float32x3, + offset: 0, + shader_location: 0, + }, + // color: 3 floats at offset 12 bytes (after 3 x f32) + wgpu::VertexAttribute { + format: wgpu::VertexFormat::Float32x3, + offset: 12, + shader_location: 1, + }, + ], +}; +``` + +This tells the GPU: "each vertex is N bytes apart (`array_stride`), and within each vertex, location 0 is three floats starting at byte 0, and location 1 is three floats starting at byte 12." + +The `shader_location` values correspond to `@location(n)` in your WGSL shader. + +#### How data flows from CPU to vertex shader + +```text +CPU Memory GPU Memory Vertex Shader +========== ========== ============= + + Vertex array copy Vertex buffer read @location(0) position + [pos, color] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ [bytes...] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ @location(1) color + [pos, color] [bytes...] + [pos, color] [bytes...] + + The layout descriptor tells the GPU how to + interpret the bytes into typed attributes. +``` + +#### The vertex shader's job + +The vertex shader runs once per vertex. It must output a `@builtin(position)` value in **clip space** โ€” a coordinate system where: +- `x` ranges from -1 (left) to +1 (right) +- `y` ranges from -1 (bottom) to +1 (top) +- `z` ranges from 0 (near) to 1 (far) + +Anything outside these ranges is clipped (not drawn). + +```wgsl +struct VertexInput { + @location(0) position: vec3, + @location(1) color: vec3, +} + +struct VertexOutput { + @builtin(position) clip_position: vec4, + @location(0) color: vec3, +} + +@vertex +fn vs_main(in: VertexInput) -> VertexOutput { + var out: VertexOutput; + out.clip_position = vec4f(in.position, 1.0); + out.color = in.color; + return out; +} +``` + +In this simple shader, the position passes through unchanged (we are already working in clip space). In real applications, you would multiply by a model-view-projection matrix to transform from 3D world coordinates to clip space. + +> **Key takeaway**: vertices carry per-point data (position, colour, etc.) packed into a buffer. The vertex buffer layout tells the GPU how to decode the bytes. The vertex shader transforms each vertex's position into clip space and passes any additional data (like colour) to the next stage. --- ### 8. Interpolation and the fragment shader -๐Ÿšง This section is a stub. Full content tracked in [edu-ga8p]. +After the vertex shader has transformed all vertices and the GPU has assembled them into triangles, **rasterisation** takes over. This section explains what happens between the vertex shader and the fragment shader โ€” the critical concept of **interpolation**. + +#### From triangles to pixels + +Rasterisation determines which pixels on screen fall inside each triangle. For each pixel inside a triangle, the rasteriser generates a **fragment**. But what data does each fragment carry? + +Consider a triangle with a red vertex, a green vertex, and a blue vertex: + +```text + Red (1,0,0) + /\ + / \ + / \ + / what \ + / colour \ + / is this \ + / pixel? \ + /______________\ +Green Blue +(0,1,0) (0,0,1) +``` + +A pixel near the red vertex should be mostly red. A pixel exactly in the centre should be an equal mix of red, green, and blue. The GPU computes this automatically using **barycentric interpolation**. + +#### Barycentric coordinates + +Every point inside a triangle can be expressed as a weighted combination of the three vertices. These weights are called **barycentric coordinates** (w0, w1, w2), where: +- w0 + w1 + w2 = 1.0 +- All weights are between 0 and 1 + +```text +Barycentric Interpolation +========================= + + Point P inside triangle ABC: + + P = w0 * A + w1 * B + w2 * C + + At vertex A: w0=1, w1=0, w2=0 โ†’ colour = A.color + At vertex B: w0=0, w1=1, w2=0 โ†’ colour = B.color + At vertex C: w0=0, w1=0, w2=1 โ†’ colour = C.color + At centre: w0=โ…“, w1=โ…“, w2=โ…“ โ†’ colour = average +``` + +The GPU performs this interpolation automatically for **every field** in the `VertexOutput` struct (except `@builtin(position)`, which is used for rasterisation itself). This means colours, texture coordinates, normals โ€” everything โ€” gets smoothly interpolated across the triangle surface. + +#### The fragment shader + +The fragment shader runs once for each fragment generated by rasterisation. It receives the interpolated values from the vertex shader and outputs a colour: + +```wgsl +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return vec4f(in.color, 1.0); +} +``` + +In this simple example, the fragment shader just passes through the interpolated colour. But you can do much more: +- Sample a texture using interpolated UV coordinates +- Apply lighting calculations using interpolated normals +- Compute procedural patterns based on the fragment position +- Discard fragments to create transparency cutouts + +#### What the fragment receives + +The fragment shader's input looks like the vertex shader's output, but the values have been interpolated: + +```text +Vertex Shader Output Fragment Shader Input +==================== ==================== + + Vertex 0: color=(1,0,0) + Vertex 1: color=(0,1,0) โ”€โ”€โ–บ Fragment at centre: + Vertex 2: color=(0,0,1) color=(0.33, 0.33, 0.33) +``` + +The `@builtin(position)` field in the fragment shader's input contains the fragment's window-space coordinates โ€” `(x, y)` in pixel coordinates. This can be useful for screen-space effects. + +#### Visual result + +When you draw our red-green-blue triangle, the interpolation produces a smooth colour gradient: + +```text + Expected visual output: + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ + โ”‚ โ–ฒ Red โ”‚ + โ”‚ โ•ฑ โ•ฒ โ”‚ + โ”‚ โ•ฑ โ•ฒ โ”‚ + โ”‚ โ•ฑ gra โ•ฒ โ”‚ + โ”‚ โ•ฑ dient โ•ฒ โ”‚ + โ”‚ โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ โ”‚ + โ”‚ Green Blue โ”‚ + โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + + Colours blend smoothly across the + triangle surface via interpolation. +``` + +> **Key takeaway**: the GPU automatically interpolates all vertex shader outputs across the triangle surface using barycentric coordinates. The fragment shader receives these smoothly interpolated values and uses them to compute the final pixel colour. This is why a triangle with three different vertex colours produces a smooth gradient. --- ### 9. Exercise 2: draw a coloured triangle -๐Ÿšง This section is a stub. Full content tracked in [edu-10m1]. +Time to put theory into practice. In this exercise you will extend the Exercise 1 code to draw a coloured triangle with red, green, and blue vertices. + +#### What you will add + +- A WGSL shader with vertex and fragment entry points +- A vertex buffer with three coloured vertices +- A render pipeline that connects everything +- A draw call inside the render pass + +#### Step 1: add bytemuck to Cargo.toml + +```toml +[dependencies] +wgpu = { version = "24", features = ["wgsl"] } +winit = "30" +pollster = "0.4" +log = "0.4" +env_logger = "0.11" +bytemuck = { version = "1", features = ["derive"] } +``` + +#### Step 2: the WGSL shader + +Create a file called `shader.wgsl` in the same directory as `main.rs` (or embed it as a string โ€” we will embed it here for simplicity): + +```wgsl +// Vertex input: position and colour from the vertex buffer +struct VertexInput { + @location(0) position: vec3, + @location(1) color: vec3, +} + +// Output from vertex shader, input to fragment shader +struct VertexOutput { + @builtin(position) clip_position: vec4, + @location(0) color: vec3, +} + +@vertex +fn vs_main(in: VertexInput) -> VertexOutput { + var out: VertexOutput; + out.clip_position = vec4f(in.position, 1.0); + out.color = in.color; + return out; +} + +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return vec4f(in.color, 1.0); +} +``` + +#### Step 3: the Rust code + +Below is the complete program. It extends Exercise 1 with a vertex buffer, pipeline, and draw call: + +```rust +use winit::{ + application::ApplicationHandler, + event::WindowEvent, + event_loop::EventLoop, + window::{Window, WindowAttributes}, +}; +use wgpu::util::DeviceExt; +use std::sync::Arc; + +// Vertex data structure โ€” must match the shader's VertexInput +#[repr(C)] +#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)] +struct Vertex { + position: [f32; 3], + color: [f32; 3], +} + +impl Vertex { + /// Describe the memory layout for the GPU. + fn layout() -> wgpu::VertexBufferLayout<'static> { + wgpu::VertexBufferLayout { + array_stride: std::mem::size_of::() as u64, + step_mode: wgpu::VertexStepMode::Vertex, + attributes: &[ + wgpu::VertexAttribute { + format: wgpu::VertexFormat::Float32x3, + offset: 0, + shader_location: 0, // @location(0) position + }, + wgpu::VertexAttribute { + format: wgpu::VertexFormat::Float32x3, + offset: 12, + shader_location: 1, // @location(1) color + }, + ], + } + } +} + +// Three vertices forming a coloured triangle +const VERTICES: &[Vertex] = &[ + Vertex { position: [ 0.0, 0.5, 0.0], color: [1.0, 0.0, 0.0] }, // top โ€” red + Vertex { position: [-0.5, -0.5, 0.0], color: [0.0, 1.0, 0.0] }, // bottom-left โ€” green + Vertex { position: [ 0.5, -0.5, 0.0], color: [0.0, 0.0, 1.0] }, // bottom-right โ€” blue +]; + +struct GpuState { + surface: wgpu::Surface<'static>, + device: wgpu::Device, + queue: wgpu::Queue, + config: wgpu::SurfaceConfiguration, + pipeline: wgpu::RenderPipeline, + vertex_buffer: wgpu::Buffer, +} + +struct App { + window: Option>, + gpu: Option, +} + +impl App { + fn new() -> Self { + Self { window: None, gpu: None } + } + + fn init_gpu(&mut self, window: Arc) { + let size = window.inner_size(); + let instance = wgpu::Instance::new(&Default::default()); + let surface = instance.create_surface(window.clone()).unwrap(); + + let adapter = pollster::block_on(instance.request_adapter( + &wgpu::RequestAdapterOptions { + compatible_surface: Some(&surface), + ..Default::default() + }, + )).unwrap(); + + let (device, queue) = pollster::block_on( + adapter.request_device(&Default::default(), None) + ).unwrap(); + + let config = surface + .get_default_config(&adapter, size.width.max(1), size.height.max(1)) + .unwrap(); + surface.configure(&device, &config); + + // Create the shader module from WGSL source + let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor { + label: Some("Triangle Shader"), + source: wgpu::ShaderSource::Wgsl(include_str!("shader.wgsl").into()), + }); + + // Create the render pipeline + let pipeline_layout = device.create_pipeline_layout( + &wgpu::PipelineLayoutDescriptor { + label: Some("Pipeline Layout"), + bind_group_layouts: &[], + push_constant_ranges: &[], + }, + ); + + let pipeline = device.create_render_pipeline( + &wgpu::RenderPipelineDescriptor { + label: Some("Triangle Pipeline"), + layout: Some(&pipeline_layout), + vertex: wgpu::VertexState { + module: &shader, + entry_point: Some("vs_main"), + buffers: &[Vertex::layout()], + compilation_options: Default::default(), + }, + fragment: Some(wgpu::FragmentState { + module: &shader, + entry_point: Some("fs_main"), + targets: &[Some(wgpu::ColorTargetState { + format: config.format, + blend: Some(wgpu::BlendState::REPLACE), + write_mask: wgpu::ColorWrites::ALL, + })], + compilation_options: Default::default(), + }), + primitive: wgpu::PrimitiveState { + topology: wgpu::PrimitiveTopology::TriangleList, + strip_index_format: None, + front_face: wgpu::FrontFace::Ccw, + cull_mode: Some(wgpu::Face::Back), + unclipped_depth: false, + polygon_mode: wgpu::PolygonMode::Fill, + conservative: false, + }, + depth_stencil: None, + multisample: wgpu::MultisampleState::default(), + multiview: None, + cache: None, + }, + ); + + // Create the vertex buffer + let vertex_buffer = device.create_buffer_init( + &wgpu::util::BufferInitDescriptor { + label: Some("Vertex Buffer"), + contents: bytemuck::cast_slice(VERTICES), + usage: wgpu::BufferUsages::VERTEX, + }, + ); + + self.gpu = Some(GpuState { + surface, device, queue, config, pipeline, vertex_buffer, + }); + } + + fn render(&self) { + let gpu = self.gpu.as_ref().unwrap(); + let output = gpu.surface.get_current_texture().unwrap(); + let view = output.texture.create_view(&Default::default()); + let mut encoder = gpu.device.create_command_encoder(&Default::default()); + + { + let mut pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor { + label: Some("Triangle Pass"), + color_attachments: &[Some(wgpu::RenderPassColorAttachment { + view: &view, + resolve_target: None, + ops: wgpu::Operations { + load: wgpu::LoadOp::Clear(wgpu::Color { + r: 0.1, g: 0.1, b: 0.1, a: 1.0, + }), + store: wgpu::StoreOp::Store, + }, + })], + depth_stencil_attachment: None, + ..Default::default() + }); + + pass.set_pipeline(&gpu.pipeline); + pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..)); + pass.draw(0..3, 0..1); // 3 vertices, 1 instance + } + + gpu.queue.submit(std::iter::once(encoder.finish())); + output.present(); + } +} + +impl ApplicationHandler for App { + fn resumed(&mut self, event_loop: &winit::event_loop::ActiveEventLoop) { + if self.window.is_none() { + let window = Arc::new( + event_loop.create_window( + WindowAttributes::default().with_title("Exercise 2: Coloured Triangle") + ).unwrap() + ); + self.init_gpu(window.clone()); + self.window = Some(window); + } + } + + fn window_event( + &mut self, + event_loop: &winit::event_loop::ActiveEventLoop, + _id: winit::window::WindowId, + event: WindowEvent, + ) { + match event { + WindowEvent::CloseRequested => event_loop.exit(), + WindowEvent::Resized(size) => { + if let Some(gpu) = &mut self.gpu { + gpu.config.width = size.width.max(1); + gpu.config.height = size.height.max(1); + gpu.surface.configure(&gpu.device, &gpu.config); + } + } + WindowEvent::RedrawRequested => { + self.render(); + if let Some(w) = &self.window { w.request_redraw(); } + } + _ => {} + } + } +} + +fn main() { + env_logger::init(); + let event_loop = EventLoop::new().unwrap(); + event_loop.run_app(&mut App::new()).unwrap(); +} +``` + +#### Step 4: run and observe + +```sh +cargo run +``` + +You should see a triangle centred in the window with a smooth gradient: red at the top, green at the bottom-left, and blue at the bottom-right. The colours blend smoothly across the surface thanks to the interpolation discussed in section 8. + +#### Key concepts demonstrated + +1. **Vertex struct** with `#[repr(C)]` and `bytemuck` for safe casting to bytes +2. **Vertex buffer layout** mapping struct fields to `@location(n)` in the shader +3. **Shader module** loaded from WGSL source via `include_str!` +4. **Render pipeline** connecting shaders, vertex layout, and output format +5. **Draw call** (`pass.draw(0..3, 0..1)`) telling the GPU to process 3 vertices as one triangle + +**Challenge**: add three more vertices and draw a second triangle to form a rectangle. You will need 6 vertices total (two triangles of 3 vertices each) and change the draw call to `pass.draw(0..6, 0..1)`. --- ### 10. Exercise 3: animate the triangle using a time uniform -๐Ÿšง This section is a stub. Full content tracked in [edu-cr0w]. +Static shapes are nice, but animation is where shaders really shine. In this exercise you will pass an elapsed time value to the vertex shader and use it to rotate the triangle. + +#### New concepts + +- **Uniform buffers**: small, read-only buffers for data that is the same across all vertices/fragments in a draw call (like time, camera matrices, light positions) +- **Bind groups**: how you connect uniform buffers (and other resources) to shader bindings +- **Updating buffers**: writing new data to a buffer each frame + +#### Step 1: the updated WGSL shader + +```wgsl +struct VertexInput { + @location(0) position: vec3, + @location(1) color: vec3, +} + +struct VertexOutput { + @builtin(position) clip_position: vec4, + @location(0) color: vec3, +} + +// A uniform buffer containing the elapsed time +@group(0) @binding(0) +var time: f32; + +@vertex +fn vs_main(in: VertexInput) -> VertexOutput { + // Rotate the vertex around the Z axis + let angle = time; + let cos_a = cos(angle); + let sin_a = sin(angle); + + let rotated = vec3f( + in.position.x * cos_a - in.position.y * sin_a, + in.position.x * sin_a + in.position.y * cos_a, + in.position.z, + ); + + var out: VertexOutput; + out.clip_position = vec4f(rotated, 1.0); + out.color = in.color; + return out; +} + +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return vec4f(in.color, 1.0); +} +``` + +The key change is the `time` uniform and the 2D rotation matrix applied to each vertex. The rotation is: + +```text +x' = x * cos(angle) - y * sin(angle) +y' = x * sin(angle) + y * cos(angle) +``` + +This rotates the triangle around the origin (centre of clip space) at one radian per second. + +#### Step 2: create the uniform buffer and bind group + +On the Rust side, you need to: + +1. Create a buffer for the time value +2. Create a bind group layout describing the binding +3. Create a bind group linking the buffer to the layout +4. Update the pipeline layout to include the bind group layout + +```rust +use std::time::Instant; + +// Create the uniform buffer (4 bytes for one f32) +let time_buffer = device.create_buffer(&wgpu::BufferDescriptor { + label: Some("Time Uniform Buffer"), + size: std::mem::size_of::() as u64, + usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST, + mapped_at_creation: false, +}); + +// Create a bind group layout +let bind_group_layout = device.create_bind_group_layout( + &wgpu::BindGroupLayoutDescriptor { + label: Some("Time Bind Group Layout"), + entries: &[wgpu::BindGroupLayoutEntry { + binding: 0, + visibility: wgpu::ShaderStages::VERTEX, + ty: wgpu::BindingType::Buffer { + ty: wgpu::BufferBindingType::Uniform, + has_dynamic_offset: false, + min_binding_size: None, + }, + count: None, + }], + }, +); + +// Create the bind group +let time_bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor { + label: Some("Time Bind Group"), + layout: &bind_group_layout, + entries: &[wgpu::BindGroupEntry { + binding: 0, + resource: time_buffer.as_entire_binding(), + }], +}); + +// Update the pipeline layout to include our bind group +let pipeline_layout = device.create_pipeline_layout( + &wgpu::PipelineLayoutDescriptor { + label: Some("Animated Pipeline Layout"), + bind_group_layouts: &[&bind_group_layout], + push_constant_ranges: &[], + }, +); +``` + +#### Step 3: update the buffer each frame + +In your render function, before beginning the render pass, write the current time to the buffer: + +```rust +let elapsed = self.start_time.elapsed().as_secs_f32(); +gpu.queue.write_buffer(&gpu.time_buffer, 0, bytemuck::cast_slice(&[elapsed])); +``` + +`queue.write_buffer` copies data from CPU memory into the GPU buffer. This is the simplest way to update a uniform each frame. + +#### Step 4: bind the group in the render pass + +Inside your render pass, after setting the pipeline: + +```rust +pass.set_pipeline(&gpu.pipeline); +pass.set_bind_group(0, &gpu.time_bind_group, &[]); // group 0 +pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..)); +pass.draw(0..3, 0..1); +``` + +The `set_bind_group(0, ...)` call makes the time buffer available to the shader as `@group(0) @binding(0)`. + +#### Expected result + +When you run the program, you should see the coloured triangle smoothly rotating around the centre of the window. The triangle completes one full rotation every 2*pi (approximately 6.28) seconds. + +#### Understanding the flow + +```text + Each frame: + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ + โ”‚ CPU: elapsed = Instant::now() - start โ”‚ + โ”‚ queue.write_buffer(time_buffer, elapsed) โ”‚ + โ”‚ โ”‚ + โ”‚ GPU: time uniform โ† time_buffer โ”‚ + โ”‚ for each vertex: โ”‚ + โ”‚ rotated_pos = rotate(vertex.pos, time) โ”‚ + โ”‚ output clip_position = rotated_pos โ”‚ + โ”‚ โ”‚ + โ”‚ Result: triangle rotates smoothly โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +**Challenge**: instead of (or in addition to) rotating, try making the triangle pulse in size using `sin(time)` as a scale factor. Or make it bounce by adding `sin(time) * 0.3` to the y position. + +> **Key takeaway**: uniform buffers let you pass per-frame data (time, matrices, parameters) to shaders. You create a buffer, describe its layout in a bind group, bind it during the render pass, and access it in WGSL via `@group(n) @binding(n)`. This is how you make shaders dynamic. --- @@ -114,13 +1562,397 @@ This document is a self-guided course on GPU shader programming. It is organised ### 11. Texture coordinates (UVs), texture creation, sampler config -๐Ÿšง This section is a stub. Full content tracked in [edu-bycd]. +Solid colours and gradients are a start, but most real-world graphics use **textures** โ€” images mapped onto surfaces. This section explains how textures work, how UV coordinates map image data onto geometry, and how samplers control the lookup. + +#### What are UV coordinates? + +**UV coordinates** (also called **texture coordinates**) describe where on a texture each vertex should sample from. They range from (0, 0) at the top-left of the texture to (1, 1) at the bottom-right: + +```text +Texture UV Space +================ + + (0,0)โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€(1,0) + โ”‚ โ”‚ + โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ image โ”‚ โ”‚ + โ”‚ โ”‚ data โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + โ”‚ โ”‚ + (0,1)โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€(1,1) + + Note: in wgpu/WebGPU, (0,0) is the top-left + and v increases downward. +``` + +Each vertex carries a UV coordinate. When the GPU rasterises a triangle, it interpolates these UVs across the surface (just like it interpolates colours). The fragment shader then uses the interpolated UV to look up a colour from the texture. + +```text +Quad with UV mapping +==================== + + Vertex 0: pos=(-0.5, 0.5) uv=(0, 0) โ† top-left + Vertex 1: pos=( 0.5, 0.5) uv=(1, 0) โ† top-right + Vertex 2: pos=(-0.5,-0.5) uv=(0, 1) โ† bottom-left + Vertex 3: pos=( 0.5,-0.5) uv=(1, 1) โ† bottom-right + + The full texture maps exactly onto the quad. +``` + +#### Creating a texture in wgpu + +To use a texture, you need to: + +1. **Create** the texture on the GPU +2. **Upload** the image data +3. **Create a texture view** for accessing it in shaders +4. **Create a sampler** that controls how texels are looked up + +```rust +// Step 1: Create the texture +let texture = device.create_texture(&wgpu::TextureDescriptor { + label: Some("My Texture"), + size: wgpu::Extent3d { + width: img_width, + height: img_height, + depth_or_array_layers: 1, + }, + mip_level_count: 1, + sample_count: 1, + dimension: wgpu::TextureDimension::D2, + format: wgpu::TextureFormat::Rgba8UnormSrgb, + usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST, + view_formats: &[], +}); + +// Step 2: Upload the pixel data +queue.write_texture( + wgpu::TexelCopyTextureInfo { + texture: &texture, + mip_level: 0, + origin: wgpu::Origin3d::ZERO, + aspect: wgpu::TextureAspect::All, + }, + &rgba_bytes, // &[u8] of RGBA pixel data + wgpu::TexelCopyBufferLayout { + offset: 0, + bytes_per_row: Some(4 * img_width), + rows_per_image: Some(img_height), + }, + wgpu::Extent3d { + width: img_width, + height: img_height, + depth_or_array_layers: 1, + }, +); + +// Step 3: Create a view +let texture_view = texture.create_view(&Default::default()); +``` + +#### Sampler configuration + +A **sampler** controls how the GPU looks up texels (texture pixels) when the UV does not land exactly on a texel centre. There are two key settings: + +**Filtering** controls how texels are blended: +- `Nearest`: picks the closest texel (pixelated look, fast) +- `Linear`: blends the four nearest texels (smooth look) + +```text + Nearest filtering Linear filtering + ================== ================= + + โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” + โ”‚ A โ”‚ B โ”‚ โ”‚ โ”‚ A โ”‚ B โ”‚ โ”‚ + โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”ผโ•Œโ•Œโ•Œโ”ผโ”€โ”€โ”€โ”ค + โ”‚ C โ”‚ D โ”‚ โ”‚ โ”‚ C โ”‚avgโ”‚ โ”‚ + โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค + โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ + + Nearest: picks one Linear: blends A,B,C,D + texel (e.g., A) based on distance +``` + +**Address mode** (wrapping) controls what happens when UVs go outside the 0-1 range: +- `ClampToEdge`: UVs outside 0-1 use the edge colour +- `Repeat`: the texture tiles +- `MirrorRepeat`: the texture tiles, flipping every other repetition + +```rust +let sampler = device.create_sampler(&wgpu::SamplerDescriptor { + label: Some("Texture Sampler"), + address_mode_u: wgpu::AddressMode::ClampToEdge, + address_mode_v: wgpu::AddressMode::ClampToEdge, + address_mode_w: wgpu::AddressMode::ClampToEdge, + mag_filter: wgpu::FilterMode::Linear, + min_filter: wgpu::FilterMode::Linear, + mipmap_filter: wgpu::FilterMode::Nearest, + ..Default::default() +}); +``` + +#### Bind groups for textures + +Textures and samplers are bound to shaders using bind groups, just like uniform buffers: + +```rust +let bind_group_layout = device.create_bind_group_layout( + &wgpu::BindGroupLayoutDescriptor { + label: Some("Texture Bind Group Layout"), + entries: &[ + // The texture + wgpu::BindGroupLayoutEntry { + binding: 0, + visibility: wgpu::ShaderStages::FRAGMENT, + ty: wgpu::BindingType::Texture { + sample_type: wgpu::TextureSampleType::Float { filterable: true }, + view_dimension: wgpu::TextureViewDimension::D2, + multisampled: false, + }, + count: None, + }, + // The sampler + wgpu::BindGroupLayoutEntry { + binding: 1, + visibility: wgpu::ShaderStages::FRAGMENT, + ty: wgpu::BindingType::Sampler( + wgpu::SamplerBindingType::Filtering, + ), + count: None, + }, + ], + }, +); +``` + +In WGSL, you access them like this: + +```wgsl +@group(0) @binding(0) +var t_diffuse: texture_2d; + +@group(0) @binding(1) +var s_diffuse: sampler; + +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return textureSample(t_diffuse, s_diffuse, in.uv); +} +``` + +The `textureSample` function takes a texture, a sampler, and UV coordinates, and returns the sampled colour. + +> **Key takeaway**: textures are images stored on the GPU. UV coordinates map texture space onto geometry. Samplers control filtering (nearest vs linear) and wrapping behaviour. The fragment shader uses `textureSample` to look up a colour from the texture at interpolated UV coordinates. --- ### 12. Exercise 4: render a textured quad -๐Ÿšง This section is a stub. Full content tracked in [edu-xv9j]. +In this exercise you will draw a rectangle (two triangles forming a quad) with a texture mapped onto it. You will create a procedural checkerboard texture in code rather than loading an image file, keeping the exercise self-contained. + +#### Step 1: add dependencies + +We do not need an image loading crate for this exercise since we generate the texture procedurally. The same `Cargo.toml` from Exercise 2 works, with `bytemuck` already included. + +#### Step 2: the WGSL shader + +```wgsl +struct VertexInput { + @location(0) position: vec3, + @location(1) uv: vec2, +} + +struct VertexOutput { + @builtin(position) clip_position: vec4, + @location(0) uv: vec2, +} + +@vertex +fn vs_main(in: VertexInput) -> VertexOutput { + var out: VertexOutput; + out.clip_position = vec4f(in.position, 1.0); + out.uv = in.uv; + return out; +} + +@group(0) @binding(0) +var t_texture: texture_2d; + +@group(0) @binding(1) +var s_sampler: sampler; + +@fragment +fn fs_main(in: VertexOutput) -> @location(0) vec4 { + return textureSample(t_texture, s_sampler, in.uv); +} +``` + +Note how the vertex now carries a `vec2` UV coordinate instead of a colour. The fragment shader samples the texture at the interpolated UV. + +#### Step 3: vertex data for a quad + +A quad is two triangles. We define six vertices (or four vertices with an index buffer โ€” we will use six for simplicity): + +```rust +#[repr(C)] +#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)] +struct Vertex { + position: [f32; 3], + uv: [f32; 2], +} + +// Two triangles forming a quad +const VERTICES: &[Vertex] = &[ + // Triangle 1 (top-left half) + Vertex { position: [-0.5, 0.5, 0.0], uv: [0.0, 0.0] }, // top-left + Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left + Vertex { position: [ 0.5, 0.5, 0.0], uv: [1.0, 0.0] }, // top-right + // Triangle 2 (bottom-right half) + Vertex { position: [ 0.5, 0.5, 0.0], uv: [1.0, 0.0] }, // top-right + Vertex { position: [-0.5, -0.5, 0.0], uv: [0.0, 1.0] }, // bottom-left + Vertex { position: [ 0.5, -0.5, 0.0], uv: [1.0, 1.0] }, // bottom-right +]; +``` + +#### Step 4: generate a procedural checkerboard texture + +```rust +/// Generate an 8x8 checkerboard pattern as RGBA bytes. +fn make_checkerboard(width: u32, height: u32, cell_size: u32) -> Vec { + let mut pixels = Vec::with_capacity((width * height * 4) as usize); + for y in 0..height { + for x in 0..width { + let is_white = ((x / cell_size) + (y / cell_size)) % 2 == 0; + let val = if is_white { 255u8 } else { 80u8 }; + pixels.push(val); // R + pixels.push(val); // G + pixels.push(val); // B + pixels.push(255); // A + } + } + pixels +} +``` + +Call it with `make_checkerboard(256, 256, 32)` to get a 256x256 texture with 32-pixel checker cells. + +#### Step 5: create the texture, view, and sampler + +```rust +let tex_size = 256u32; +let tex_data = make_checkerboard(tex_size, tex_size, 32); + +let texture = device.create_texture(&wgpu::TextureDescriptor { + label: Some("Checkerboard Texture"), + size: wgpu::Extent3d { + width: tex_size, + height: tex_size, + depth_or_array_layers: 1, + }, + mip_level_count: 1, + sample_count: 1, + dimension: wgpu::TextureDimension::D2, + format: wgpu::TextureFormat::Rgba8UnormSrgb, + usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST, + view_formats: &[], +}); + +queue.write_texture( + wgpu::TexelCopyTextureInfo { + texture: &texture, + mip_level: 0, + origin: wgpu::Origin3d::ZERO, + aspect: wgpu::TextureAspect::All, + }, + &tex_data, + wgpu::TexelCopyBufferLayout { + offset: 0, + bytes_per_row: Some(4 * tex_size), + rows_per_image: Some(tex_size), + }, + wgpu::Extent3d { + width: tex_size, + height: tex_size, + depth_or_array_layers: 1, + }, +); + +let texture_view = texture.create_view(&Default::default()); + +let sampler = device.create_sampler(&wgpu::SamplerDescriptor { + label: Some("Checkerboard Sampler"), + mag_filter: wgpu::FilterMode::Nearest, // crisp pixels for checkerboard + min_filter: wgpu::FilterMode::Nearest, + ..Default::default() +}); +``` + +#### Step 6: bind group setup + +```rust +let bind_group_layout = device.create_bind_group_layout( + &wgpu::BindGroupLayoutDescriptor { + label: Some("Texture Bind Group Layout"), + entries: &[ + wgpu::BindGroupLayoutEntry { + binding: 0, + visibility: wgpu::ShaderStages::FRAGMENT, + ty: wgpu::BindingType::Texture { + sample_type: wgpu::TextureSampleType::Float { filterable: true }, + view_dimension: wgpu::TextureViewDimension::D2, + multisampled: false, + }, + count: None, + }, + wgpu::BindGroupLayoutEntry { + binding: 1, + visibility: wgpu::ShaderStages::FRAGMENT, + ty: wgpu::BindingType::Sampler(wgpu::SamplerBindingType::Filtering), + count: None, + }, + ], + }, +); + +let bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor { + label: Some("Texture Bind Group"), + layout: &bind_group_layout, + entries: &[ + wgpu::BindGroupEntry { + binding: 0, + resource: wgpu::BindingResource::TextureView(&texture_view), + }, + wgpu::BindGroupEntry { + binding: 1, + resource: wgpu::BindingResource::Sampler(&sampler), + }, + ], +}); +``` + +Remember to include `&bind_group_layout` in your pipeline layout's `bind_group_layouts` array, and update the vertex buffer layout to match the new `Vertex` struct (position: Float32x3 at offset 0, uv: Float32x2 at offset 12). + +#### Step 7: draw the quad + +In your render pass: + +```rust +pass.set_pipeline(&gpu.pipeline); +pass.set_bind_group(0, &gpu.bind_group, &[]); +pass.set_vertex_buffer(0, gpu.vertex_buffer.slice(..)); +pass.draw(0..6, 0..1); // 6 vertices = 2 triangles = 1 quad +``` + +#### Expected result + +You should see a rectangle in the centre of the window showing a black-and-white checkerboard pattern. The texture is mapped so that the full checkerboard fills the quad exactly. + +**Challenge**: try changing the sampler's `mag_filter` from `Nearest` to `Linear` and see how the checkerboard edges become blurred when the quad is large. Then try setting `address_mode_u` and `address_mode_v` to `Repeat`, and change the UVs to go from 0 to 3 โ€” you will see the checkerboard tile three times across the quad. + +> **Key takeaway**: texturing involves creating a texture from pixel data, configuring a sampler for filtering and wrapping, binding both via a bind group, and sampling in the fragment shader using interpolated UV coordinates. This same pattern applies whether your texture is a checkerboard, a photograph, or a render target from a previous pass. --- @@ -128,19 +1960,503 @@ This document is a self-guided course on GPU shader programming. It is organised ### 13. Compute pipelines: dispatching work groups -๐Ÿšง This section is a stub. Full content tracked in [edu-2sqo]. +Compute shaders break free from the graphics pipeline entirely. There are no vertices, no triangles, no pixels โ€” just raw parallel computation. This makes them ideal for physics simulations, image processing, data transformations, and any task that benefits from GPU parallelism. + +#### Graphics pipeline vs compute pipeline + +```text +Graphics Pipeline Compute Pipeline +================= ================ + + Vertices Dispatch(x, y, z) + โ”‚ โ”‚ + โ–ผ โ–ผ + Vertex Shader Compute Shader + โ”‚ โ”‚ + โ–ผ โ–ผ + Rasterisation Storage buffers / + โ”‚ textures (output) + โ–ผ + Fragment Shader + โ”‚ + โ–ผ + Framebuffer (pixels) + + Produces images Produces data +``` + +With compute shaders, you do not set up vertex buffers, render passes, or colour attachments. Instead, you **dispatch** work and let the compute shader read/write storage buffers or textures directly. + +#### Work groups and invocations + +When you dispatch a compute shader, you specify a 3D grid of **work groups**. Each work group contains a fixed number of **invocations** (threads), defined by `@workgroup_size` in the shader. + +```text +Dispatch and Work Groups +======================== + + dispatch(4, 3, 1) โ† 4 x 3 x 1 = 12 work groups + + โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” + โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ row 0 + โ”‚(0,0)โ”‚ โ”‚(1,0)โ”‚ โ”‚(2,0)โ”‚ โ”‚(3,0)โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” + โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ row 1 + โ”‚(0,1)โ”‚ โ”‚(1,1)โ”‚ โ”‚(2,1)โ”‚ โ”‚(3,1)โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” + โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ โ”‚ WG โ”‚ row 2 + โ”‚(0,2)โ”‚ โ”‚(1,2)โ”‚ โ”‚(2,2)โ”‚ โ”‚(3,2)โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ + + Inside each work group (e.g., @workgroup_size(8, 8, 1)): + + โ”Œโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ” + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ 8 invocations wide + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ x 8 invocations tall + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ = 64 invocations per + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค work group + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ + โ”œโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ผโ”€โ”ค + โ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ยทโ”‚ + โ””โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”˜ + + Total invocations = 12 work groups x 64 = 768 threads +``` + +#### Built-in IDs + +Each invocation knows its position in the grid via built-in variables: + +| Built-in | Type | Meaning | +|----------|------|---------| +| `global_invocation_id` | `vec3` | Unique ID across the entire dispatch | +| `local_invocation_id` | `vec3` | ID within the work group (0 to workgroup_size-1) | +| `workgroup_id` | `vec3` | Which work group this invocation belongs to | +| `num_workgroups` | `vec3` | Total number of work groups dispatched | + +`global_invocation_id` is the most commonly used โ€” it gives each thread a unique index. + +```wgsl +@compute @workgroup_size(64) +fn main(@builtin(global_invocation_id) id: vec3) { + let index = id.x; + // Process element at `index` +} +``` + +#### Choosing workgroup_size + +The `@workgroup_size(x, y, z)` declaration sets how many invocations run per work group. Guidelines: + +- **Total invocations per group** (x * y * z) should be a multiple of 32 or 64 for best performance (matching GPU warp/wavefront size) +- Common choices: `@workgroup_size(64)`, `@workgroup_size(256)`, `@workgroup_size(8, 8)` (for 2D), `@workgroup_size(4, 4, 4)` (for 3D) +- The maximum total varies by GPU but is typically 256 or 1024 + +#### Creating a compute pipeline in Rust + +```rust +let compute_shader = device.create_shader_module(wgpu::ShaderModuleDescriptor { + label: Some("Compute Shader"), + source: wgpu::ShaderSource::Wgsl(shader_source.into()), +}); + +let compute_pipeline = device.create_compute_pipeline( + &wgpu::ComputePipelineDescriptor { + label: Some("Compute Pipeline"), + layout: Some(&pipeline_layout), + module: &compute_shader, + entry_point: Some("main"), + compilation_options: Default::default(), + cache: None, + }, +); +``` + +#### Dispatching work + +Instead of a render pass, you use a **compute pass**: + +```rust +let mut encoder = device.create_command_encoder(&Default::default()); +{ + let mut compute_pass = encoder.begin_compute_pass(&Default::default()); + compute_pass.set_pipeline(&compute_pipeline); + compute_pass.set_bind_group(0, &bind_group, &[]); + compute_pass.dispatch_workgroups(num_groups_x, num_groups_y, num_groups_z); +} +queue.submit(std::iter::once(encoder.finish())); +``` + +If you have 1024 elements and your workgroup_size is 64, you dispatch `1024 / 64 = 16` work groups: `dispatch_workgroups(16, 1, 1)`. + +> **Key takeaway**: compute shaders run outside the graphics pipeline. You dispatch a 3D grid of work groups, each containing a fixed number of invocations. Every invocation gets a unique `global_invocation_id` to determine which data element to process. This is how you harness the GPU's parallelism for general-purpose computation. --- ### 14. Storage buffers and read/write access from WGSL -๐Ÿšง This section is a stub. Full content tracked in [edu-uxa1]. +Compute shaders need to read input data and write output data. **Storage buffers** are the primary mechanism for this. Unlike uniform buffers (which are small and read-only), storage buffers can be large and support both reading and writing. + +#### Storage buffers vs uniform buffers + +| Feature | Uniform Buffer | Storage Buffer | +|---------|---------------|----------------| +| Max size | ~64 KB (varies) | Hundreds of MB | +| Access | Read-only | Read-only or read-write | +| Speed | Faster (cached aggressively) | Slightly slower | +| Use case | Small, per-frame constants | Large data arrays | + +Use uniform buffers for things like transformation matrices, time values, and camera parameters. Use storage buffers for arrays of particles, pixels, mesh data, or any large dataset. + +#### Declaring storage buffers in WGSL + +```wgsl +// Read-only storage buffer +@group(0) @binding(0) +var input: array; + +// Read-write storage buffer +@group(0) @binding(1) +var output: array; +``` + +You can also use structs: + +```wgsl +struct Particle { + position: vec2, + velocity: vec2, +} + +@group(0) @binding(0) +var particles: array; +``` + +#### Accessing storage buffer data + +Storage buffers behave like regular arrays in WGSL: + +```wgsl +@compute @workgroup_size(64) +fn main(@builtin(global_invocation_id) id: vec3) { + let i = id.x; + + // Bounds check โ€” important when dispatch size + // does not evenly divide the data + if i >= arrayLength(&particles) { + return; + } + + // Read + let pos = particles[i].position; + let vel = particles[i].velocity; + + // Compute + let new_pos = pos + vel * delta_time; + + // Write back + particles[i].position = new_pos; +} +``` + +The `arrayLength(&buffer)` function returns the number of elements in a runtime-sized array. Always use it for bounds checking โ€” if your dispatch creates more invocations than data elements, the extra threads must bail out early. + +#### Creating storage buffers in Rust + +```rust +// Create a storage buffer from initial data +let storage_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor { + label: Some("Particle Buffer"), + contents: bytemuck::cast_slice(&initial_particles), + usage: wgpu::BufferUsages::STORAGE + | wgpu::BufferUsages::COPY_SRC // to read back to CPU + | wgpu::BufferUsages::COPY_DST, // to write from CPU +}); +``` + +The `STORAGE` usage flag is required. Add `COPY_SRC` if you want to read data back to the CPU, and `COPY_DST` if you want to upload data from the CPU. + +#### Bind group layout for storage buffers + +```rust +wgpu::BindGroupLayoutEntry { + binding: 0, + visibility: wgpu::ShaderStages::COMPUTE, + ty: wgpu::BindingType::Buffer { + ty: wgpu::BufferBindingType::Storage { + read_only: false, // true for read-only access + }, + has_dynamic_offset: false, + min_binding_size: None, + }, + count: None, +} +``` + +#### Reading results back to the CPU + +GPU buffers are not directly accessible from CPU memory. To read results back, you copy to a staging buffer with `MAP_READ` usage: + +```rust +// Create a staging buffer +let staging_buffer = device.create_buffer(&wgpu::BufferDescriptor { + label: Some("Staging Buffer"), + size: storage_buffer.size(), + usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST, + mapped_at_creation: false, +}); + +// Copy from storage to staging +encoder.copy_buffer_to_buffer( + &storage_buffer, 0, + &staging_buffer, 0, + storage_buffer.size(), +); +queue.submit(std::iter::once(encoder.finish())); + +// Map the staging buffer and read the data +let slice = staging_buffer.slice(..); +slice.map_async(wgpu::MapMode::Read, |_| {}); +device.poll(wgpu::Maintain::Wait); + +let data = slice.get_mapped_range(); +let result: &[Particle] = bytemuck::cast_slice(&data); +// Use the result... + +drop(data); +staging_buffer.unmap(); +``` + +#### Memory considerations + +- **Workgroup memory**: WGSL also supports `var` for shared memory within a work group. This is very fast but limited in size (typically 16-48 KB). +- **Synchronization**: within a work group, use `workgroupBarrier()` to ensure all threads have finished writing before any thread reads shared data. Across work groups, there is no synchronization within a single dispatch โ€” use separate dispatches if you need global barriers. + +```wgsl +var shared_data: array; + +@compute @workgroup_size(64) +fn main(@builtin(local_invocation_id) lid: vec3) { + shared_data[lid.x] = some_computation(); + workgroupBarrier(); // wait for all threads in this group + let neighbour = shared_data[(lid.x + 1u) % 64u]; +} +``` + +> **Key takeaway**: storage buffers are the workhorse of compute shaders โ€” they hold large arrays that shaders can read and write. Declare them with `var` in WGSL, create them with `BufferUsages::STORAGE` in Rust, and always bounds-check with `arrayLength`. To read results back to CPU, copy to a staging buffer with `MAP_READ`. --- ### 15. Exercise 5: GPU-accelerate a particle simulation -๐Ÿšง This section is a stub. Full content tracked in [edu-exby]. +In this exercise you will build a simple particle system where thousands of particles are updated each frame by a compute shader. Particles will have positions and velocities, bounce off the edges of the screen, and be rendered as points. + +#### Overview + +The architecture is: + +```text + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ CPU: init โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ GPU: compute pass โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ GPU: render โ”‚ + โ”‚ particles โ”‚ โ”‚ update positions โ”‚ โ”‚ pass: draw โ”‚ + โ”‚ once โ”‚ โ”‚ each frame โ”‚ โ”‚ as points โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + Storage buffer + (read/write by + compute shader, + read as vertex + buffer for render) +``` + +The same buffer serves double duty: the compute shader writes updated positions into it, and the render pass reads it as a vertex buffer. + +#### Step 1: particle data structure + +```rust +#[repr(C)] +#[derive(Copy, Clone, bytemuck::Pod, bytemuck::Zeroable)] +struct Particle { + position: [f32; 2], + velocity: [f32; 2], +} +``` + +#### Step 2: initialise particles + +```rust +use rand::Rng; + +fn create_particles(count: usize) -> Vec { + let mut rng = rand::rng(); + (0..count) + .map(|_| Particle { + position: [ + rng.random_range(-1.0f32..1.0), + rng.random_range(-1.0f32..1.0), + ], + velocity: [ + rng.random_range(-0.5f32..0.5), + rng.random_range(-0.5f32..0.5), + ], + }) + .collect() +} +``` + +Add `rand = "0.9"` to your `Cargo.toml`. + +#### Step 3: the compute shader (WGSL) + +```wgsl +struct Particle { + position: vec2, + velocity: vec2, +} + +@group(0) @binding(0) +var particles: array; + +@group(0) @binding(1) +var delta_time: f32; + +@compute @workgroup_size(64) +fn cs_main(@builtin(global_invocation_id) id: vec3) { + let i = id.x; + if i >= arrayLength(&particles) { + return; + } + + var p = particles[i]; + + // Update position + p.position = p.position + p.velocity * delta_time; + + // Bounce off edges + if p.position.x < -1.0 || p.position.x > 1.0 { + p.velocity.x = -p.velocity.x; + p.position.x = clamp(p.position.x, -1.0, 1.0); + } + if p.position.y < -1.0 || p.position.y > 1.0 { + p.velocity.y = -p.velocity.y; + p.position.y = clamp(p.position.y, -1.0, 1.0); + } + + particles[i] = p; +} +``` + +#### Step 4: the render shader (WGSL) + +To render particles as points, the vertex shader reads the position from the storage buffer. Each particle becomes one point: + +```wgsl +struct RenderOutput { + @builtin(position) pos: vec4, + @builtin(point_size) size: f32, +} + +// We read the same particle buffer as a storage buffer for rendering +@group(0) @binding(0) +var render_particles: array; + +@vertex +fn vs_render(@builtin(vertex_index) vi: u32) -> RenderOutput { + var out: RenderOutput; + let p = render_particles[vi]; + out.pos = vec4f(p.position, 0.0, 1.0); + out.size = 2.0; + return out; +} + +@fragment +fn fs_render() -> @location(0) vec4 { + return vec4f(0.2, 0.8, 0.4, 1.0); // green particles +} +``` + +Note: `@builtin(point_size)` is an optional feature; not all backends support it. An alternative approach is to render each particle as a small quad using instancing. + +#### Step 5: buffer creation + +```rust +let num_particles = 10_000u32; +let particles = create_particles(num_particles as usize); + +let particle_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor { + label: Some("Particle Buffer"), + contents: bytemuck::cast_slice(&particles), + usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::VERTEX, +}); + +let dt_buffer = device.create_buffer(&wgpu::BufferDescriptor { + label: Some("Delta Time Buffer"), + size: 4, + usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST, + mapped_at_creation: false, +}); +``` + +The particle buffer has both `STORAGE` (for the compute shader) and `VERTEX` (for the render pipeline) usage flags. + +#### Step 6: frame loop + +Each frame: + +1. Calculate delta time +2. Write delta time to the uniform buffer +3. Run the compute pass to update particles +4. Run the render pass to draw particles + +```rust +// Compute pass +{ + let mut cpass = encoder.begin_compute_pass(&Default::default()); + cpass.set_pipeline(&compute_pipeline); + cpass.set_bind_group(0, &compute_bind_group, &[]); + let num_workgroups = (num_particles + 63) / 64; // round up + cpass.dispatch_workgroups(num_workgroups, 1, 1); +} + +// Render pass +{ + let mut rpass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor { + color_attachments: &[Some(wgpu::RenderPassColorAttachment { + view: &view, + resolve_target: None, + ops: wgpu::Operations { + load: wgpu::LoadOp::Clear(wgpu::Color::BLACK), + store: wgpu::StoreOp::Store, + }, + })], + ..Default::default() + }); + rpass.set_pipeline(&render_pipeline); + rpass.set_bind_group(0, &render_bind_group, &[]); + rpass.draw(0..num_particles, 0..1); +} +``` + +Note how `dispatch_workgroups` rounds up: `(10000 + 63) / 64 = 157` work groups, giving 10048 invocations. The bounds check in the shader (`if i >= arrayLength(&particles)`) prevents the extra 48 threads from accessing out-of-bounds memory. + +#### Expected result + +You should see thousands of small green particles bouncing around the window, all updated in parallel on the GPU. With 10,000 particles at 60 FPS, the GPU handles 600,000 particle updates per second with ease โ€” and it could handle millions. + +**Challenge**: add a gravity force (`p.velocity.y -= 9.8 * delta_time`) and watch the particles fall and bounce off the bottom edge. Or add mouse interaction โ€” pass the mouse position as a uniform and apply a force toward or away from the cursor. + +> **Key takeaway**: compute shaders can update large datasets in parallel every frame. By giving a buffer both `STORAGE` and `VERTEX` usage flags, you can update data in a compute pass and render it in a render pass without copying between buffers. This compute-then-render pattern is the foundation of GPU-driven simulations. --- @@ -148,16 +2464,291 @@ This document is a self-guided course on GPU shader programming. It is organised ### 16. Post-processing effects (bloom, blur): conceptual overview -๐Ÿšง This section is a stub. Full content tracked in [edu-9lda]. +So far, you have rendered directly to the screen. But many visual effects require **multi-pass rendering**: render the scene to an intermediate texture first, then process that texture in subsequent passes before displaying the final result. This is called **post-processing**. + +#### Render-to-texture + +Instead of targeting the swap chain texture directly, you create an off-screen texture and render to it: + +```text +Render-to-Texture +================== + + Pass 1: Render scene Pass 2: Post-process + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ Scene geometry โ”‚โ”€โ”€render toโ”€โ”€โ–บ โ”‚ Full-screen โ”‚โ”€โ”€render toโ”€โ”€โ–บ Screen + โ”‚ (3D objects) โ”‚ off-screen โ”‚ quad sampling โ”‚ swap chain + โ”‚ โ”‚ texture โ”‚ the texture โ”‚ texture + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +In wgpu, this means creating a `wgpu::Texture` with `RENDER_ATTACHMENT | TEXTURE_BINDING` usage. You render to it in pass 1, then sample from it in pass 2. + +#### Bloom effect + +Bloom makes bright areas of an image glow, simulating how real cameras and eyes perceive very bright light. The algorithm has three stages: + +```text +Bloom Pipeline +============== + + Scene โ”€โ”€โ–บ [1. Threshold] โ”€โ”€โ–บ [2. Blur] โ”€โ”€โ–บ [3. Composite] โ”€โ”€โ–บ Final + Extract Gaussian Add blurred + bright blur the bright areas + pixels result back onto + only the original +``` + +**Stage 1 โ€” Threshold**: a fragment shader that outputs only pixels brighter than a threshold, and black for everything else. + +```wgsl +@fragment +fn threshold(in: FullscreenInput) -> @location(0) vec4 { + let color = textureSample(scene_texture, samp, in.uv); + let brightness = dot(color.rgb, vec3f(0.2126, 0.7152, 0.0722)); + if brightness > 0.8 { + return color; + } + return vec4f(0.0, 0.0, 0.0, 1.0); +} +``` + +The `dot` with `(0.2126, 0.7152, 0.0722)` computes perceptual luminance โ€” the human eye is most sensitive to green, then red, then blue. + +**Stage 2 โ€” Gaussian blur**: blur the thresholded image so bright spots become soft glows. Gaussian blur is **separable** โ€” you can split a 2D blur into two 1D passes (horizontal then vertical), which is much faster: + +```text +Separable Gaussian Blur +======================= + + Bright Horizontal Vertical Blurred + pixels โ”€โ”€โ–บ blur pass โ”€โ”€โ–บ blur pass โ”€โ”€โ–บ result + (1D, left (1D, up + to right) to down) + + A 9x9 2D kernel = 81 samples per pixel + Two 9-wide 1D kernels = 18 samples per pixel + Same result, 4.5x faster! +``` + +A single-direction blur shader samples several neighbouring texels with Gaussian weights: + +```wgsl +@fragment +fn blur_horizontal(in: FullscreenInput) -> @location(0) vec4 { + let texel_size = 1.0 / f32(textureDimensions(source).x); + var result = vec4f(0.0); + + // Gaussian weights for a 5-tap kernel + let weights = array(0.227, 0.194, 0.122, 0.054, 0.016); + let offsets = array(0.0, 1.0, 2.0, 3.0, 4.0); + + for (var i = 0u; i < 5u; i = i + 1u) { + let offset = vec2f(offsets[i] * texel_size, 0.0); + result += textureSample(source, samp, in.uv + offset) * weights[i]; + if i > 0u { + result += textureSample(source, samp, in.uv - offset) * weights[i]; + } + } + return result; +} +``` + +**Stage 3 โ€” Composite**: add the blurred bright areas back onto the original scene: + +```wgsl +@fragment +fn composite(in: FullscreenInput) -> @location(0) vec4 { + let scene = textureSample(scene_texture, samp, in.uv); + let bloom = textureSample(bloom_texture, samp, in.uv); + return scene + bloom * bloom_intensity; +} +``` + +#### Other post-processing effects + +The render-to-texture pattern enables many effects: + +- **Colour grading**: adjust contrast, saturation, colour curves +- **Vignette**: darken the edges of the screen +- **Chromatic aberration**: split RGB channels with slight offsets +- **Motion blur**: blend the current frame with previous frames +- **Depth of field**: blur based on distance from a focal point (requires a depth buffer) +- **Screen-space ambient occlusion (SSAO)**: approximate indirect shadows + +Each effect is a fragment shader running on a full-screen quad, sampling from the previous pass's texture. + +> **Key takeaway**: post-processing effects are implemented as multi-pass rendering. You render the scene to an off-screen texture, then process it through one or more full-screen fragment shader passes. Bloom is a classic example: threshold bright pixels, blur them with separable Gaussian passes, and composite the glow back onto the original. This pattern is the backbone of modern real-time visual effects. --- ### 17. Signed Distance Fields for font rendering -๐Ÿšง This section is a stub. Full content tracked in [edu-1nox]. +Rendering crisp text at any size and rotation is surprisingly difficult with traditional bitmap fonts. **Signed Distance Fields** (SDFs) provide an elegant solution that gives resolution-independent, anti-aliased text with a single texture. + +#### The problem with bitmap fonts + +A bitmap font is a texture where each character is stored as a grid of pixels: + +```text + Bitmap "A" at 32px: Zoomed in (pixelated): + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ” + โ”‚ โ–ˆโ–ˆ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚ โ”‚ โ”‚ + โ”‚ โ–ˆ โ–ˆ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ โ”‚ + โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚โ–ˆโ–ˆโ”‚ + โ”‚ โ–ˆ โ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ + โ”‚ โ–ˆ โ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ โ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”˜ + + Looks fine at 32px. Looks blocky at 128px. +``` + +If you scale the bitmap up, it becomes pixelated. If you scale it down, details are lost. You would need multiple texture sizes, wasting memory. + +#### What is a Signed Distance Field? + +An SDF stores, for each texel, the **distance to the nearest edge** of the shape. Texels inside the shape have negative distances; texels outside have positive distances. The zero-crossing is the exact edge. + +```text + SDF for a circle: + + +3 +2 +1 0 -1 -2 -1 0 +1 +2 +3 + +2 +1 0 -1 -2 -3 -2 -1 0 +1 +2 + +1 0 -1 -2 -3 -4 -3 -2 -1 0 +1 + 0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 + +1 0 -1 -2 -3 -4 -3 -2 -1 0 +1 + +2 +1 0 -1 -2 -3 -2 -1 0 +1 +2 + +3 +2 +1 0 -1 -2 -1 0 +1 +2 +3 + + โ† 0 is the edge. Negative = inside. Positive = outside. +``` + +The key insight is that this distance information contains the shape at **any resolution**. To render, you simply check whether the distance is negative (inside, draw the character) or positive (outside, draw nothing). + +#### The smoothstep trick + +Hard thresholding (inside vs outside) gives jagged edges. The `smoothstep` function provides perfect anti-aliasing by creating a smooth transition in a narrow band around the edge: + +```wgsl +@fragment +fn sdf_text(in: VertexOutput) -> @location(0) vec4 { + // Sample the SDF texture โ€” value is distance to edge + let distance = textureSample(sdf_texture, samp, in.uv).r; + + // smoothstep creates a smooth transition near the edge + // 0.5 is the edge; the range (0.45, 0.55) is the anti-alias band + let alpha = smoothstep(0.45, 0.55, distance); + + return vec4f(text_color.rgb, alpha); +} +``` + +```text + smoothstep visualised: + + alpha + 1.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ + โ•ฒ + โ•ฒ โ† smooth transition + โ•ฒ (anti-aliased edge) + 0.0 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + outside edge inside + 0.45 0.5 0.55 +``` + +The width of the transition band can be adjusted. A narrower band gives sharper text; a wider band gives softer text. You can even compute the band width based on the rate of change of the UV coordinates (using `fwidth`) to get pixel-perfect anti-aliasing at any scale: + +```wgsl +let distance = textureSample(sdf_texture, samp, in.uv).r; +let edge = 0.5; +let aa_width = fwidth(distance) * 0.75; +let alpha = smoothstep(edge - aa_width, edge + aa_width, distance); +``` + +#### Advantages of SDF text + +- **Resolution-independent**: one small texture (e.g., 64x64 per glyph) looks crisp at any display size +- **Cheap anti-aliasing**: just `smoothstep` โ€” no multisampling needed +- **Effects for free**: outlines, drop shadows, and glow are trivial to add by adjusting the distance threshold: + +```wgsl +// Outline effect +let outline_alpha = smoothstep(0.35, 0.40, distance); // outer edge of outline +let fill_alpha = smoothstep(0.45, 0.55, distance); // inner fill +let color = mix(outline_color, fill_color, fill_alpha); +let alpha = outline_alpha; +``` + +```text + SDF effects by varying the threshold: + + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ dist < 0.35 โ†’ outside (transparent) โ”‚ + โ”‚ 0.35 to 0.45 โ†’ outline โ”‚ + โ”‚ dist > 0.45 โ†’ fill (solid text) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +#### Generating SDF textures + +SDF textures are typically pre-generated offline. Tools include: + +- **msdfgen**: generates multi-channel SDFs for even sharper edges +- **Hiero** (LibGDX): generates SDF font atlases +- **fontdue** (Rust crate): can generate SDF glyph bitmaps + +The generated SDF texture is a single-channel (greyscale) image where 0.5 represents the edge, values above 0.5 are inside the glyph, and values below 0.5 are outside. + +> **Key takeaway**: signed distance fields store the distance to a shape's edge at each texel. This allows rendering crisp, anti-aliased shapes at any resolution from a small texture. The `smoothstep` function provides the anti-aliasing, and varying the distance threshold enables outlines, glows, and shadows. SDF-based text rendering is used in game engines, mapping applications, and anywhere resolution-independent text is needed. --- ### 18. Resources: Learn WGPU, Shadertoy, The Book of Shaders -๐Ÿšง This section is a stub. Full content tracked in [edu-7m8d]. +This section collects the best resources for continuing your shader programming journey. Each resource approaches the topic from a different angle โ€” use them together for a well-rounded education. + +#### Tutorials and courses + +**[Learn WGPU](https://sotrh.github.io/learn-wgpu/)** โ€” the definitive tutorial for wgpu in Rust. It walks through window setup, textures, camera systems, lighting, instancing, and more, with complete working code at each step. If you want to build on the exercises in this course, this is the natural next step. + +**[The Book of Shaders](https://thebookofshaders.com/)** by Patricio Gonzalez Vivo and Jen Lowe โ€” a gentle, visual introduction to fragment shaders. It uses GLSL (not WGSL), but the concepts translate directly: noise functions, patterns, colour mixing, shapes, and animation. The interactive editor lets you experiment in real time. Excellent for building shader intuition. + +**[GPU Gems](https://developer.nvidia.com/gpugems/gpugems/contributors)** โ€” NVIDIA's classic book series (available free online). Covers advanced topics like water rendering, subsurface scattering, shadow techniques, and GPU physics. The techniques are presented in HLSL/GLSL but the algorithms are API-agnostic. + +**[WebGPU Fundamentals](https://webgpufundamentals.org/)** โ€” explains WebGPU concepts from the ground up with JavaScript examples. Since wgpu implements the WebGPU spec, the API concepts map directly to Rust. Useful for understanding the "why" behind API design decisions. + +#### Interactive playgrounds + +**[Shadertoy](https://www.shadertoy.com/)** โ€” a web-based shader playground where you write fragment shaders (GLSL) and see results immediately. The community has created incredible effects: raymarched landscapes, fluid simulations, fractal zooms, entire games. Study other people's shaders to learn techniques โ€” the compact format forces creative solutions. You can port Shadertoy ideas to WGSL in your wgpu projects. + +**[WGSL Playground](https://google.github.io/tour-of-wgsl/)** โ€” Google's Tour of WGSL. An interactive introduction to the WGSL language with runnable examples. Good for quickly testing WGSL syntax. + +#### Specifications and references + +**[WebGPU Specification](https://www.w3.org/TR/webgpu/)** โ€” the official W3C specification that wgpu implements. Dense but authoritative. Useful when you need to understand exact behaviour. + +**[WGSL Specification](https://www.w3.org/TR/WGSL/)** โ€” the complete language specification for WGSL. Reference for built-in functions, types, memory models, and grammar. + +**[wgpu documentation (docs.rs)](https://docs.rs/wgpu)** โ€” Rust API documentation for the wgpu crate. Essential reference for looking up function signatures, enum variants, and descriptor fields. + +#### Advanced topics to explore + +Once you are comfortable with the basics covered in this course, here are directions to explore: + +- **3D rendering**: model-view-projection matrices, depth buffers, camera systems +- **Lighting**: Phong, Blinn-Phong, physically-based rendering (PBR) +- **Shadow mapping**: rendering depth from light's perspective, shadow comparison +- **Instancing**: drawing thousands of objects efficiently with a single draw call +- **Raymarching**: rendering 3D scenes using signed distance functions (no triangles) +- **Procedural generation**: noise functions (Perlin, Simplex) for terrain, textures, and clouds +- **Deferred rendering**: separating geometry and lighting into different passes +- **Skeletal animation**: vertex skinning with bone matrices + +#### Community + +- **[wgpu GitHub](https://github.com/gfx-rs/wgpu)** โ€” the source code, issue tracker, and examples +- **[WebGPU Matrix channel](https://matrix.to/#/#webgpu:matrix.org)** โ€” real-time chat with the wgpu developers and community +- **[r/rust_gamedev](https://www.reddit.com/r/rust_gamedev/)** โ€” Rust game development community on Reddit, where wgpu projects are frequently shared + +> **Key takeaway**: shader programming is a vast field. Start with Learn WGPU for Rust-specific guidance, The Book of Shaders for visual intuition, and Shadertoy for inspiration. Keep the WGSL spec and wgpu docs.rs handy as references. The GPU programming community is active and welcoming โ€” share your work and learn from others.