docs(edu): write markov §7 bigram text generator exercise [74be50]

Full Exercise 3 — Bigram Text Generator section replacing the stub: - Setup instructions (cargo new, cargo add rand) - Starter code skeleton with BigramModel struct - Step 1: tokenize corpus with split_whitespace + to_lowercase - Step 2: build transition table via windows(2) with count tracking - Step 3: integer weighted sampling helper (sample_weighted) - Step 4: generate implementation with dead-end handling - Step 5: run the generator with the Alice corpus and seeded RNG - Step 6: guidance for trying a larger Project Gutenberg corpus - Full reference solution in a collapsible details block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5 months ago · f78c9ba637
parent e31316f863
commit f78c9ba637
1 changed files with 235 additions and 2 deletions
--- a/edu/markov.md
+++ b/edu/markov.md
@ -623,9 +623,28 @@ Notice that "cat," "sat," and "on" each have only one possible successor — the

 ### 7. Exercise 3 — Bigram Text Generator

-**Goal:** Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel sentences of a given length from a chosen seed word.
+**Goal:** Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel word sequences of a given length from a chosen seed word.
+
+#### Setup
+
+You can extend the `random-walk` project from Exercise 2, or create a fresh binary:
+
+```sh
+cargo new bigram-text
+cd bigram-text
+cargo add rand
+```
+
+You will also need `use std::collections::HashMap;` at the top of `src/main.rs`.
+
+#### Starter Code
+
+Replace the contents of `src/main.rs` with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the `todo!()` bodies and write `main`.

 ```rust
+use rand::Rng;
+use std::collections::HashMap;
+
 struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
@ -635,9 +654,223 @@ impl BigramModel {
    fn train(corpus: &str) -> Self { todo!() }
    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
 }
+
+fn main() { todo!() }
+```
+
+#### Step 1 — Tokenize the corpus
+
+Inside `train`, split the corpus into lowercase words:
+
+```rust
+let words: Vec<String> = corpus
+    .split_whitespace()
+    .map(|s| s.to_lowercase())
+    .collect();
+```
+
+`split_whitespace` handles any run of spaces, newlines, or tabs. Lowercasing ensures "The" and "the" are treated as the same state. For this exercise we leave punctuation attached to words (so "sat." and "sat" are distinct) to keep the code simple — stripping it is a good stretch goal.
+
+#### Step 2 — Build the transition table
+
+Iterate every consecutive word pair using `windows(2)`. The first word is the current state; the second is the successor to record:
+
+```rust
+for window in words.windows(2) {
+    let current = window[0].clone();
+    let next    = window[1].clone();
+    let entry = transitions.entry(current).or_default();
+    if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
+        pair.1 += 1;
+    } else {
+        entry.push((next, 1));
+    }
+}
+```
+
+For each word, build a list of `(successor_word, count)` pairs. If the successor already appears in the list, increment its count; otherwise push a new entry with count 1. After scanning the whole corpus, every word maps to a weighted list of what can follow it — exactly one row of the bigram transition table.
+
+Return `BigramModel { transitions }`.
+
+#### Step 3 — Implement weighted random sampling
+
+`generate` must sample the next word proportionally to its observed count in the successor list. Add a helper function outside `impl BigramModel` that performs **integer weighted sampling** — faster and numerically exact compared to floating-point accumulation:
+
+```rust
+fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
+    let total: usize = choices.iter().map(|(_, w)| w).sum();
+    let mut r = rng.gen_range(0..total);
+    for (word, weight) in choices {
+        if r < *weight {
+            return word;
+        }
+        r -= weight;
+    }
+    &choices.last().unwrap().0   // fallback: satisfies the compiler
+}
+```
+
+The `r -= weight` trick avoids floating-point comparisons: draw a random index in `[0, total)`, then walk the list subtracting each bucket's weight until the remaining value lands inside the current entry. This is the discrete-distribution equivalent of the cumulative-probability walk used in Exercise 1, rewritten in integer arithmetic.
+
+#### Step 4 — Implement `BigramModel::generate`
+
+`generate` starts from a seed word and appends one word at a time:
+
+1. Push the seed onto `output` and set `current = seed.to_string()`.
+2. Loop up to `length` times: look up `current` in `self.transitions` and call `sample_weighted` on the result.
+3. Push the sampled word onto `output` and advance `current` to it.
+4. If `current` is not in the transition table (a dead end — the word never appeared mid-sentence in the corpus), stop early with `break`.
+
+```rust
+let mut output = vec![seed.to_string()];
+let mut current = seed.to_string();
+
+for _ in 0..length {
+    match self.transitions.get(&current) {
+        None => break,
+        Some(choices) => {
+            let next = sample_weighted(choices, rng).to_string();
+            output.push(next.clone());
+            current = next;
+        }
+    }
+}
+output
+```
+
+#### Step 5 — Run the generator
+
+In `main`, train on a short corpus and generate several sequences from different seed words. Use a seeded RNG for reproducibility (add `use rand::{SeedableRng, rngs::SmallRng};`):
+
+```rust
+const CORPUS: &str =
+    "alice was beginning to get very tired of sitting by her sister on the \
+     bank and of having nothing to do once or twice she had peeped into the \
+     book her sister was reading but it had no pictures or conversations in \
+     it and what is the use of a book thought alice without pictures or \
+     conversations alice was beginning to get very tired of sitting";
+
+fn main() {
+    let mut rng = SmallRng::seed_from_u64(42);
+    let model = BigramModel::train(CORPUS);
+
+    for seed in &["alice", "the", "book"] {
+        let words = model.generate(seed, 15, &mut rng);
+        println!("{}", words.join(" "));
+    }
+}
+```
+
+Expected output structure (exact words vary by seed):
+
+```
+alice was beginning to get very tired of sitting by her sister was beginning
+the use of a book her sister on the bank and of sitting by her sister
+book thought alice without pictures or conversations alice was beginning to get
+```
+
+Observe that every adjacent pair in the output appeared in the training corpus — each local transition is faithful to the source text. But over longer spans the topic shifts erratically, because the model has no memory beyond the immediately preceding word.
+
+#### Step 6 — Try a larger corpus
+
+The small Alice snippet produces repetitive output because many words have only one possible successor in a short text. To see genuine branching behaviour, download the first chapter of *Alice's Adventures in Wonderland* from Project Gutenberg (plain text, freely available) and feed the full text to `BigramModel::train`. With more data:
+
+- Common words like "the" and "and" branch to dozens of successors.
+- Generated sequences stay on-topic for longer stretches before the topic shifts.
+- The difference between the single-word context used here and the two-word context in Exercise 4 becomes immediately obvious.
+
+You can embed the text file directly in the binary with Rust's `include_str!` macro:
+
+```rust
+const CORPUS: &str = include_str!("../corpus/alice_ch1.txt");
+```
+
+#### Reference Solution
+
+<details>
+<summary>Show full solution</summary>
+
+```rust
+use rand::{Rng, SeedableRng, rngs::SmallRng};
+use std::collections::HashMap;
+
+struct BigramModel {
+    /// transitions[word] = list of (next_word, weight) pairs
+    transitions: HashMap<String, Vec<(String, usize)>>,
+}
+
+impl BigramModel {
+    fn train(corpus: &str) -> Self {
+        let words: Vec<String> = corpus
+            .split_whitespace()
+            .map(|s| s.to_lowercase())
+            .collect();
+
+        let mut transitions: HashMap<String, Vec<(String, usize)>> = HashMap::new();
+
+        for window in words.windows(2) {
+            let current = window[0].clone();
+            let next    = window[1].clone();
+            let entry = transitions.entry(current).or_default();
+            if let Some(pair) = entry.iter_mut().find(|(w, _)| w == &next) {
+                pair.1 += 1;
+            } else {
+                entry.push((next, 1));
+            }
+        }
+
+        BigramModel { transitions }
+    }
+
+    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> {
+        let mut output = vec![seed.to_string()];
+        let mut current = seed.to_string();
+
+        for _ in 0..length {
+            match self.transitions.get(&current) {
+                None => break,
+                Some(choices) => {
+                    let next = sample_weighted(choices, rng).to_string();
+                    output.push(next.clone());
+                    current = next;
+                }
+            }
+        }
+        output
+    }
+}
+
+fn sample_weighted<'a>(choices: &'a [(String, usize)], rng: &mut impl Rng) -> &'a str {
+    let total: usize = choices.iter().map(|(_, w)| w).sum();
+    let mut r = rng.gen_range(0..total);
+    for (word, weight) in choices {
+        if r < *weight {
+            return word;
+        }
+        r -= weight;
+    }
+    &choices.last().unwrap().0
+}
+
+const CORPUS: &str =
+    "alice was beginning to get very tired of sitting by her sister on the \
+     bank and of having nothing to do once or twice she had peeped into the \
+     book her sister was reading but it had no pictures or conversations in \
+     it and what is the use of a book thought alice without pictures or \
+     conversations alice was beginning to get very tired of sitting";
+
+fn main() {
+    let mut rng = SmallRng::seed_from_u64(42);
+    let model = BigramModel::train(CORPUS);
+
+    for seed in &["alice", "the", "book"] {
+        let words = model.generate(seed, 15, &mut rng);
+        println!("{}", words.join(" "));
+    }
+}
 ```

-> 🚧 This section is a stub — see nbd ticket `74be50`
+</details>

 ---