vibed/edu/markov.md

# Markov Chain Self-Guided Course

This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket — follow the ticket ID to find the detailed learning objectives and instructions.

---

## Table of Contents

**Part 1 — Foundations**

1. [What Is a Markov Chain?](#1-what-is-a-markov-chain)
2. [States and Transitions](#2-states-and-transitions)
3. [Transition Probabilities and Matrices](#3-transition-probabilities-and-matrices)

**Part 2 — First Implementation**

4. [Exercise 1 — Weather Model](#4-exercise-1--weather-model)
5. [Exercise 2 — Simulating a Random Walk](#5-exercise-2--simulating-a-random-walk)

**Part 3 — Text Generation**

6. [Text Generation with Markov Chains](#6-text-generation-with-markov-chains)
7. [Exercise 3 — Bigram Text Generator](#7-exercise-3--bigram-text-generator)
8. [Exercise 4 — N-gram Generalization](#8-exercise-4--n-gram-generalization)

**Part 4 — Deeper Concepts**

9. [Stationary Distributions](#9-stationary-distributions)
10. [Applications and Further Reading](#10-applications-and-further-reading)

---

## Part 1 — Foundations

### 1. What Is a Markov Chain?

A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event — not on the full history. This "memoryless" property is called the **Markov property**. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.

> 🚧 This section is a stub — see nbd ticket `fbf323`

---

### 2. States and Transitions

Every Markov chain consists of a finite (or countably infinite) set of **states** and a rule for how the system moves between them. This section formalizes the vocabulary: states, transitions, directed graphs, and the notion of a chain's **step** or **time**. By the end you will be able to draw a state-transition diagram for a simple real-world system.

> 🚧 This section is a stub — see nbd ticket `738be2`

---

### 3. Transition Probabilities and Matrices

The rules governing how a Markov chain moves are captured in a **transition matrix** *P*, where *P[i][j]* is the probability of moving from state *i* to state *j* in one step. This section covers how to construct *P*, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.

> 🚧 This section is a stub — see nbd ticket `44ebe7`

---

## Part 2 — First Implementation

### 4. Exercise 1 — Weather Model

**Goal:** Build a two-state Markov chain in Rust that models daily weather as either `Sunny` or `Rainy`, driven by a transition matrix, and simulate 30 days of weather.

```rust
#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `257a2a`

---

### 5. Exercise 2 — Simulating a Random Walk

**Goal:** Implement a one-dimensional random walk on the integers (states −*N* … +*N*) with reflecting boundaries, then measure the empirical distribution of positions after *T* steps.

```rust
struct RandomWalk {
    min: i32,
    max: i32,
    /// prob_right[i] = probability of stepping right from position i
    prob_right: Vec<f64>,
}

impl RandomWalk {
    fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
    fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
        -> HashMap<i32, usize> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `64826a`

---

## Part 3 — Text Generation

### 6. Text Generation with Markov Chains

Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the **bigram** model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here — it prepares you for Exercises 3 and 4.

**Words as states.** To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word "cat," what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of "cat" and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word — not on the full sentence or paragraph that came before. This is the Markov property applied to language.

**Bigrams and transition tables.** A **bigram** is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word *w* express those counts as a probability distribution over successor words. This distribution forms one row of the **transition table**: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data — no grammar rules, no meaning, just co-occurrence statistics.

**Worked example.** Consider this two-sentence corpus:

> *"the cat sat on the mat. the cat sat on the hat."*

Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:

| Current word | Next word | Count | Probability |
|---|---|---|---|
| the | cat | 2 | 0.50 |
| the | mat | 1 | 0.25 |
| the | hat | 1 | 0.25 |
| cat | sat | 2 | 1.00 |
| sat | on  | 2 | 1.00 |
| on  | the | 2 | 1.00 |

Notice that "cat," "sat," and "on" each have only one possible successor — the small corpus left them no choice. Only "the" branches: half the time it is followed by "cat," a quarter of the time by "mat," and a quarter by "hat." Starting from "the," one plausible generated sequence is: *the → cat → sat → on → the → hat* — and then "hat" would be an end-of-sentence state.

**Why the text sounds strange.** Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.

**Order-*n* chains.** A bigram model is an *order-1* Markov chain — the next state depends on exactly one previous word. An *order-2* model (a **trigram** model) conditions on the last two words; order *n* uses the last *n* words as context. Increasing *n* brings sharply improved local coherence — at *n* = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim — but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher *n* without the model simply memorising its training data.

---

### 7. Exercise 3 — Bigram Text Generator

**Goal:** Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel sentences of a given length from a chosen seed word.

```rust
struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
}

impl BigramModel {
    fn train(corpus: &str) -> Self { todo!() }
    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `74be50`

---

### 8. Exercise 4 — N-gram Generalization

**Goal:** Generalize the bigram model to an *n*-gram model where each state is a window of *n* consecutive words. Compare the output quality for *n* = 1, 2, 3, and 4 on the same corpus.

```rust
struct NgramModel {
    n: usize,
    transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}

impl NgramModel {
    fn train(corpus: &str, n: usize) -> Self { todo!() }
    fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `1f995a`

---

## Part 4 — Deeper Concepts

### 9. Stationary Distributions

A **stationary distribution** *π* is a probability distribution over states that is unchanged by one step of the chain: *π P = π*. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique — introducing the concepts of **irreducibility** and **aperiodicity**.

> 🚧 This section is a stub — see nbd ticket `68ee16`

---

### 10. Applications and Further Reading

Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.

> 🚧 This section is a stub — see nbd ticket `5994a6`