vibed/edu/markov.md

# Markov Chain Self-Guided Course

This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket — follow the ticket ID to find the detailed learning objectives and instructions.

---

## Table of Contents

**Part 1 — Foundations**

1. [What Is a Markov Chain?](#1-what-is-a-markov-chain)
2. [States and Transitions](#2-states-and-transitions)
3. [Transition Probabilities and Matrices](#3-transition-probabilities-and-matrices)

**Part 2 — First Implementation**

4. [Exercise 1 — Weather Model](#4-exercise-1--weather-model)
5. [Exercise 2 — Simulating a Random Walk](#5-exercise-2--simulating-a-random-walk)

**Part 3 — Text Generation**

6. [Text Generation with Markov Chains](#6-text-generation-with-markov-chains)
7. [Exercise 3 — Bigram Text Generator](#7-exercise-3--bigram-text-generator)
8. [Exercise 4 — N-gram Generalization](#8-exercise-4--n-gram-generalization)

**Part 4 — Deeper Concepts**

9. [Stationary Distributions](#9-stationary-distributions)
10. [Applications and Further Reading](#10-applications-and-further-reading)

---

## Part 1 — Foundations

### 1. What Is a Markov Chain?

A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event — not on the full history. This "memoryless" property is called the **Markov property**. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.

> 🚧 This section is a stub — see nbd ticket `fbf323`

---

### 2. States and Transitions

Every Markov chain consists of a finite (or countably infinite) set of **states** and a rule for how the system moves between them. This section formalizes the vocabulary: states, transitions, directed graphs, and the notion of a chain's **step** or **time**. By the end you will be able to draw a state-transition diagram for a simple real-world system.

> 🚧 This section is a stub — see nbd ticket `738be2`

---

### 3. Transition Probabilities and Matrices

The rules governing how a Markov chain moves are captured in a **transition matrix** *P*, where *P[i][j]* is the probability of moving from state *i* to state *j* in one step. This section covers how to construct *P*, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.

**Defining the transition matrix.** Label the states 0, 1, …, *n*−1. The transition matrix *P* is an *n* × *n* array where entry *P[i][j]* gives the probability of moving to state *j* on the very next step, given that you are currently in state *i*. Because these are probabilities of mutually exclusive, exhaustive outcomes (from state *i* you must go *somewhere*), every row must sum to exactly 1 and every entry must lie between 0 and 1 inclusive. A matrix satisfying these two constraints is called a **stochastic matrix** (or row-stochastic matrix). Each row is itself a probability distribution over the next state.

**The stochastic-matrix constraints, stated precisely.** For an *n*-state chain:

```
P[i][j] >= 0        for all i, j
sum_j P[i][j] = 1   for every row i
```

A zero entry means the transition is impossible; a one means it is certain. Columns have no such constraint — column sums need not equal 1.

**Multi-step probabilities via matrix multiplication.** Suppose you start in state *i* at time 0. After one step the probability of being in state *j* is *P[i][j]*. After *two* steps you pass through some intermediate state *k*, so:

```
P^2[i][j] = sum_k  P[i][k] * P[k][j]
```

This is exactly the (*i*, *j*) entry of *P* × *P* = *P*². In general, the probability of going from state *i* to state *j* in exactly *k* steps is the (*i*, *j*) entry of *P*^k. If you encode your current uncertainty as a **row vector** π₀ — a probability distribution over all states — then after *k* steps your updated distribution is:

```
π_k = π₀ · P^k
```

Each right-multiplication by *P* advances the clock one tick and blends probabilities according to the transition rules.

**Worked example — a two-state weather chain.** Consider a model with two states: *Sunny* (state 0) and *Rainy* (state 1). From data:

- If today is Sunny, tomorrow is Sunny with probability 0.8 and Rainy with probability 0.2.
- If today is Rainy, tomorrow is Sunny with probability 0.4 and Rainy with probability 0.6.

Writing this as a matrix:

```
        Sunny  Rainy
Sunny [  0.8    0.2 ]
Rainy [  0.4    0.6 ]
```

Row 0 sums to 1.0; row 1 sums to 1.0. All entries are non-negative. *P* is a valid stochastic matrix.

*One step.* Start with certainty in Sunny: π₀ = [1, 0].

```
π₁ = π₀ · P = [1, 0] · [[0.8, 0.2], [0.4, 0.6]] = [0.8, 0.2]
```

Tomorrow: 80 % Sunny, 20 % Rainy.

*Two steps.* Apply *P* again:

```
π₂ = π₁ · P = [0.8, 0.2] · [[0.8, 0.2], [0.4, 0.6]]
             = [0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6]
             = [0.72, 0.28]
```

Equivalently, compute *P*² once and read off the row for state 0:

```
P^2 = [[0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6],
       [0.4*0.8 + 0.6*0.4,  0.4*0.2 + 0.6*0.6]]
    = [[0.72, 0.28],
       [0.56, 0.44]]
```

*P*²[0] = [0.72, 0.28] — matching the step-by-step result. Starting from Rainy gives *P*²[1] = [0.56, 0.44]; the two rows are already noticeably closer to each other than the original [0.8, 0.2] vs [0.4, 0.6]. As *k* grows, both rows converge toward the same limiting vector — the **stationary distribution** that Section 9 analyses in depth. The matrix-multiplication perspective makes this convergence precise and computable.

---

## Part 2 — First Implementation

### 4. Exercise 1 — Weather Model

**Goal:** Build a two-state Markov chain in Rust that models daily weather as either `Sunny` or `Rainy`, driven by a transition matrix, and simulate 30 days of weather.

#### Setup

Create a new Cargo binary project and add the `rand` crate:

```sh
cargo new weather-chain
cd weather-chain
cargo add rand
```

#### Starter Code

Replace the contents of `src/main.rs` with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the `todo!()` bodies and write `main`.

```rust
use rand::Rng;

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}

fn main() { todo!() }
```

#### Step 1 — Index conversion

`step` needs to index into `self.transition` using the current state, and later convert an index back to a `Weather` value. Add two helper methods to the `Weather` enum:

- `fn index(self) -> usize` — returns `0` for `Sunny`, `1` for `Rainy`
- `fn from_index(i: usize) -> Self` — returns `Sunny` for `0`, `Rainy` for anything else

A simple `match` expression handles both. These are the only tools you need to bridge the enum and the matrix.

#### Step 2 — Implement `WeatherChain::step`

`step` must sample the next state from the probability row for the current state. The technique is a **cumulative-probability walk**:

1. Retrieve the transition row: `let row = self.transition[current.index()];`
2. Draw a uniform float in [0, 1): `let r: f64 = rng.gen();`
3. Walk the row, accumulating probability. When the running total first exceeds `r`, return that state.

For the two-state case this reduces to a single comparison — if `r < row[0]` return `Sunny`, otherwise return `Rainy` — but implementing the loop works for any number of states and is worth practising.

Add a fallback `Weather::from_index(row.len() - 1)` after the loop to satisfy the compiler; floating-point rounding can in rare cases leave the loop without returning.

#### Step 3 — Implement `WeatherChain::simulate`

`simulate` runs the chain for `steps` transitions, collecting every state visited including the start:

1. Allocate a `Vec` with capacity `steps + 1`.
2. Push `start` and set `current = start`.
3. Loop `steps` times: call `self.step(current, rng)`, update `current`, and push.
4. Return the `Vec`.

The returned slice will have length `steps + 1` (the initial state plus one state per step).

#### Step 4 — Run the simulation

In `main`, create a `WeatherChain` with the two-state matrix from Section 3:

```
Sunny [0.8, 0.2]
Rainy [0.4, 0.6]
```

Seed a repeatable RNG with `rand::rngs::SmallRng::seed_from_u64(42)` (add `use rand::SeedableRng;`). Simulate 30 steps starting from `Sunny`, print the resulting sequence, and count how many days were sunny vs rainy.

Expected output structure (exact numbers vary by seed):

```
[Sunny, Sunny, Rainy, Sunny, ...]
Sunny days: 21 (67.7%)
Rainy days: 10 (32.3%)
```

#### Step 5 — Compare to the stationary distribution

The **stationary distribution** π satisfies π = πP, meaning once the chain reaches it, the distribution no longer changes. For this matrix, solve the two equations:

```
π₀ = 0.8·π₀ + 0.4·π₁
π₁ = 0.2·π₀ + 0.6·π₁
π₀ + π₁ = 1
```

The first equation simplifies to `0.2·π₀ = 0.4·π₁`, giving `π₀ = 2·π₁`. Substituting into the normalisation constraint: `π₀ = 2/3 ≈ 66.7%`, `π₁ = 1/3 ≈ 33.3%`.

Print the stationary percentages alongside your simulated counts. With only 31 data points the match will be rough; re-run with 1 000 or 10 000 steps to see the empirical frequencies converge.

#### Reference Solution

<details>
<summary>Show full solution</summary>

```rust
use rand::{Rng, SeedableRng, rngs::SmallRng};

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

impl Weather {
    fn index(self) -> usize {
        match self {
            Weather::Sunny => 0,
            Weather::Rainy => 1,
        }
    }

    fn from_index(i: usize) -> Self {
        match i {
            0 => Weather::Sunny,
            _ => Weather::Rainy,
        }
    }
}

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather {
        let row = self.transition[current.index()];
        let r: f64 = rng.gen();
        let mut cumulative = 0.0;
        for (i, &prob) in row.iter().enumerate() {
            cumulative += prob;
            if r < cumulative {
                return Weather::from_index(i);
            }
        }
        Weather::from_index(row.len() - 1)
    }

    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> {
        let mut states = Vec::with_capacity(steps + 1);
        states.push(start);
        let mut current = start;
        for _ in 0..steps {
            current = self.step(current, rng);
            states.push(current);
        }
        states
    }
}

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);
    let chain = WeatherChain {
        transition: [[0.8, 0.2], [0.4, 0.6]],
    };

    let states = chain.simulate(Weather::Sunny, 30, &mut rng);
    println!("{:?}", states);

    let total = states.len() as f64;
    let sunny = states.iter().filter(|&&s| s == Weather::Sunny).count();
    let rainy = states.len() - sunny;

    println!("Sunny days: {} ({:.1}%)", sunny, 100.0 * sunny as f64 / total);
    println!("Rainy days: {} ({:.1}%)", rainy, 100.0 * rainy as f64 / total);
    println!("Stationary: Sunny ≈ 66.7%, Rainy ≈ 33.3%");
}
```

</details>

---

### 5. Exercise 2 — Simulating a Random Walk

**Goal:** Implement a one-dimensional random walk on the integers (states −*N* … +*N*) with reflecting boundaries, then measure the empirical distribution of positions after *T* steps.

```rust
struct RandomWalk {
    min: i32,
    max: i32,
    /// prob_right[i] = probability of stepping right from position i
    prob_right: Vec<f64>,
}

impl RandomWalk {
    fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
    fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
        -> HashMap<i32, usize> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `64826a`

---

## Part 3 — Text Generation

### 6. Text Generation with Markov Chains

Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the **bigram** model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here — it prepares you for Exercises 3 and 4.

**Words as states.** To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word "cat," what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of "cat" and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word — not on the full sentence or paragraph that came before. This is the Markov property applied to language.

**Bigrams and transition tables.** A **bigram** is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word *w* express those counts as a probability distribution over successor words. This distribution forms one row of the **transition table**: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data — no grammar rules, no meaning, just co-occurrence statistics.

**Worked example.** Consider this two-sentence corpus:

> *"the cat sat on the mat. the cat sat on the hat."*

Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:

| Current word | Next word | Count | Probability |
|---|---|---|---|
| the | cat | 2 | 0.50 |
| the | mat | 1 | 0.25 |
| the | hat | 1 | 0.25 |
| cat | sat | 2 | 1.00 |
| sat | on  | 2 | 1.00 |
| on  | the | 2 | 1.00 |

Notice that "cat," "sat," and "on" each have only one possible successor — the small corpus left them no choice. Only "the" branches: half the time it is followed by "cat," a quarter of the time by "mat," and a quarter by "hat." Starting from "the," one plausible generated sequence is: *the → cat → sat → on → the → hat* — and then "hat" would be an end-of-sentence state.

**Why the text sounds strange.** Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.

**Order-*n* chains.** A bigram model is an *order-1* Markov chain — the next state depends on exactly one previous word. An *order-2* model (a **trigram** model) conditions on the last two words; order *n* uses the last *n* words as context. Increasing *n* brings sharply improved local coherence — at *n* = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim — but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher *n* without the model simply memorising its training data.

---

### 7. Exercise 3 — Bigram Text Generator

**Goal:** Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel sentences of a given length from a chosen seed word.

```rust
struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
}

impl BigramModel {
    fn train(corpus: &str) -> Self { todo!() }
    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `74be50`

---

### 8. Exercise 4 — N-gram Generalization

**Goal:** Generalize the bigram model to an *n*-gram model where each state is a window of *n* consecutive words. Compare the output quality for *n* = 1, 2, 3, and 4 on the same corpus.

```rust
struct NgramModel {
    n: usize,
    transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}

impl NgramModel {
    fn train(corpus: &str, n: usize) -> Self { todo!() }
    fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
```

> 🚧 This section is a stub — see nbd ticket `1f995a`

---

## Part 4 — Deeper Concepts

### 9. Stationary Distributions

A **stationary distribution** *π* is a probability distribution over states that is unchanged by one step of the chain: *π P = π*. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique — introducing the concepts of **irreducibility** and **aperiodicity**.

> 🚧 This section is a stub — see nbd ticket `68ee16`

---

### 10. Applications and Further Reading

Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.

#### Application Survey

**PageRank.** Google's original ranking algorithm modeled the web as a Markov chain: each page is a state, and each hyperlink is a transition with probability proportional to the number of outgoing links on the source page. A small "teleportation" probability was added so a random surfer occasionally jumps to a uniformly random page rather than following a link, ensuring the chain is irreducible and has a unique stationary distribution. The stationary probability of each page — the fraction of time a random surfer spends there in the long run — becomes its rank. Because the web had billions of pages, computing the stationary distribution via power iteration rather than direct matrix inversion was essential; the same convergence guarantee from Section 9 applies at planetary scale.

**Markov Chain Monte Carlo (MCMC).** Bayesian inference often requires integrating over high-dimensional parameter spaces where the posterior distribution has no closed form. MCMC methods solve this by constructing a Markov chain whose stationary distribution *is* the target posterior, then running the chain long enough that its samples approximate draws from that distribution. The **Metropolis-Hastings** algorithm is the foundational recipe: propose a move to a new state, accept it with a probability that preserves detailed balance, and reject it otherwise. Variants such as Gibbs sampling, Hamiltonian Monte Carlo, and the No-U-Turn Sampler (NUTS) power nearly every modern probabilistic programming framework, from Stan to PyMC.

**Hidden Markov Models (HMMs).** An HMM separates a Markov chain into two layers: a hidden state sequence that evolves according to a transition matrix, and an observation sequence where each hidden state emits an observable symbol with some probability. The key insight is that the true states are never directly seen — only the observations are. HMMs were the dominant approach in speech recognition for decades (phonemes as hidden states, acoustic features as observations) and remain central in bioinformatics for gene prediction and sequence segmentation. The Viterbi algorithm finds the most likely hidden state path for a given observation sequence in time linear in the sequence length; the Baum-Welch algorithm trains HMMs from unlabelled data using expectation-maximisation.

**Reinforcement Learning.** Most reinforcement learning problems are formulated as **Markov Decision Processes** (MDPs), which augment a Markov chain with a set of actions and a scalar reward signal. At each step the agent chooses an action, the environment transitions to a new state according to a transition distribution that depends on both the current state and the chosen action, and the agent receives a reward. The goal is to find a policy — a rule mapping states to actions — that maximises cumulative discounted reward. Because the next state depends only on the current state and action (not the history), the Markov property is what makes value-function algorithms like Q-learning and policy-gradient methods tractable. Sutton and Barto's textbook (listed below) treats this connection in full rigour.

**Bioinformatics — Sequence Analysis.** DNA and protein sequences are naturally modelled as Markov chains over an alphabet of bases or amino acids. A simple *k*-th order Markov model assigns probabilities to short subsequences and can distinguish coding regions from non-coding regions: CpG islands in mammalian genomes, for example, have transition probabilities measurably different from the genomic background. Profile HMMs generalise this to align whole families of sequences — each column in a multiple sequence alignment becomes a hidden state with its own emission distribution, allowing robust database search even for distantly related proteins.

**Queueing Theory.** The classic M/M/1 queue — arrivals according to a Poisson process, exponential service times, a single server — is a continuous-time Markov chain on the non-negative integers, where the state is the number of customers in the system. Its stationary distribution is geometric, giving simple closed-form expressions for average queue length and waiting time. More complex queueing networks (multiple servers, priorities, finite buffers) extend the same framework and are used to size data-centre infrastructure, analyse hospital emergency departments, and design network switches. Continuous-time Markov chains replace the transition *matrix* with a **generator matrix** *Q* whose off-diagonal entries are transition *rates* rather than probabilities.

**Language Models.** The n-gram models from Exercises 3 and 4 are finite-order Markov chains over word tokens, and they directly preceded modern neural language models. In the 1990s and 2000s, trigram and 4-gram models with smoothing (Kneser-Ney, Witten-Bell) were the state of the art for machine translation and speech recognition. Neural language models replaced explicit Markov structure with learned representations, but the conceptual scaffolding is the same: predict the next token from a bounded window of context. Understanding the Markov chain view of language — its local coherence, its lack of long-range memory, its reliance on corpus statistics — clarifies both what earlier systems got right and what neural approaches had to learn to transcend.

---

#### Further Reading

**Books**

- **Blitzstein & Hwang — *Introduction to Probability* (2nd ed.)** A beautifully written undergraduate probability text with a full chapter on Markov chains. The authors' lecture videos and course materials are freely available; a free PDF of the book is offered on the book's companion site. Start here if your probability background is thin or if you want every concept illustrated with concrete examples before the formalism arrives.

- **Norris — *Markov Chains*** The standard rigorous treatment at the advanced-undergraduate / early-graduate level. Covers discrete- and continuous-time chains, convergence, reversibility, and applications with full proofs. Dense but thorough; worth working through if you intend to read research papers that use Markov chains as a theoretical tool.

- **Sutton & Barto — *Reinforcement Learning: An Introduction* (2nd ed.)** The canonical RL textbook, freely available as a PDF from the authors. Chapters 3 and 4 formalise MDPs and dynamic programming using exactly the Markov chain machinery developed in this course. Reading those chapters after completing this course is a natural next step toward understanding how modern game-playing and robotics agents are designed.

**Articles**

- **Metropolis-Hastings algorithm — Wikipedia.** A well-maintained article that covers the algorithm statement, acceptance-ratio derivation, intuition for why detailed balance implies the correct stationary distribution, and pseudocode. A good companion to the original 1953 Metropolis *et al.* paper (five pages, freely available) and the 1970 Hastings generalisation.

---

#### Rust Ecosystem Pointers

The exercises in this course used `rand` for sampling. A few other crates are useful as you build more serious probabilistic programs in Rust:

- **`nalgebra`** — a comprehensive linear-algebra library covering vectors, dense matrices, and decompositions. Use it to compute *P*^*k* via repeated matrix multiplication or to solve the stationary-distribution equations *π P = π*, *Σ πᵢ = 1* as a linear system.

- **`petgraph`** — graph data structures and algorithms. Markov chains are directed weighted graphs, and `petgraph` lets you represent them as such, run graph-theoretic algorithms (strongly connected components give you irreducible sub-chains), and visualise the structure via its Graphviz export.

- **`statrs`** — a statistics library providing common probability distributions with their PDFs, CDFs, and samplers. Useful when building emission distributions for HMMs or when you need chi-squared tests to check whether simulated chain frequencies match theoretical stationary probabilities.