You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

18 KiB

Raw Blame History Unescape Escape

Markov Chain Self-Guided Course

This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an nbd ticket — follow the ticket ID to find the detailed learning objectives and instructions.

Part 1 — Foundations

What Is a Markov Chain?
States and Transitions
Transition Probabilities and Matrices

Part 2 — First Implementation

Exercise 1 — Weather Model
Exercise 2 — Simulating a Random Walk

Part 3 — Text Generation

Text Generation with Markov Chains
Exercise 3 — Bigram Text Generator
Exercise 4 — N-gram Generalization

Part 4 — Deeper Concepts

Stationary Distributions
Applications and Further Reading

Part 1 — Foundations

1. What Is a Markov Chain?

A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event — not on the full history. This "memoryless" property is called the Markov property. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.

🚧 This section is a stub — see nbd ticket fbf323

2. States and Transitions

Every Markov chain consists of a finite (or countably infinite) set of states and a rule for how the system moves between them. This section formalizes the vocabulary: states, transitions, directed graphs, and the notion of a chain's step or time. By the end you will be able to draw a state-transition diagram for a simple real-world system.

🚧 This section is a stub — see nbd ticket 738be2

3. Transition Probabilities and Matrices

The rules governing how a Markov chain moves are captured in a transition matrix P, where P[i][j] is the probability of moving from state i to state j in one step. This section covers how to construct P, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.

Defining the transition matrix. Label the states 0, 1, …, n−1. The transition matrix P is an n × n array where entry P[i][j] gives the probability of moving to state j on the very next step, given that you are currently in state i. Because these are probabilities of mutually exclusive, exhaustive outcomes (from state i you must go somewhere), every row must sum to exactly 1 and every entry must lie between 0 and 1 inclusive. A matrix satisfying these two constraints is called a stochastic matrix (or row-stochastic matrix). Each row is itself a probability distribution over the next state.

The stochastic-matrix constraints, stated precisely. For an n-state chain:

P[i][j] >= 0        for all i, j
sum_j P[i][j] = 1   for every row i

A zero entry means the transition is impossible; a one means it is certain. Columns have no such constraint — column sums need not equal 1.

Multi-step probabilities via matrix multiplication. Suppose you start in state i at time 0. After one step the probability of being in state j is P[i][j]. After two steps you pass through some intermediate state k, so:

P^2[i][j] = sum_k  P[i][k] * P[k][j]

This is exactly the (i, j) entry of P × P = P². In general, the probability of going from state i to state j in exactly k steps is the (i, j) entry of P^k. If you encode your current uncertainty as a row vector π₀ — a probability distribution over all states — then after k steps your updated distribution is:

π_k = π₀ · P^k

Each right-multiplication by P advances the clock one tick and blends probabilities according to the transition rules.

Worked example — a two-state weather chain. Consider a model with two states: Sunny (state 0) and Rainy (state 1). From data:

If today is Sunny, tomorrow is Sunny with probability 0.8 and Rainy with probability 0.2.
If today is Rainy, tomorrow is Sunny with probability 0.4 and Rainy with probability 0.6.

Writing this as a matrix:

        Sunny  Rainy
Sunny [  0.8    0.2 ]
Rainy [  0.4    0.6 ]

Row 0 sums to 1.0; row 1 sums to 1.0. All entries are non-negative. P is a valid stochastic matrix.

One step. Start with certainty in Sunny: π₀ = [1, 0].

π₁ = π₀ · P = [1, 0] · [[0.8, 0.2], [0.4, 0.6]] = [0.8, 0.2]

Tomorrow: 80 % Sunny, 20 % Rainy.

Two steps. Apply P again:

π₂ = π₁ · P = [0.8, 0.2] · [[0.8, 0.2], [0.4, 0.6]]
             = [0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6]
             = [0.72, 0.28]

Equivalently, compute P² once and read off the row for state 0:

P^2 = [[0.8*0.8 + 0.2*0.4,  0.8*0.2 + 0.2*0.6],
       [0.4*0.8 + 0.6*0.4,  0.4*0.2 + 0.6*0.6]]
    = [[0.72, 0.28],
       [0.56, 0.44]]

P²[0] = [0.72, 0.28] — matching the step-by-step result. Starting from Rainy gives P²[1] = [0.56, 0.44]; the two rows are already noticeably closer to each other than the original [0.8, 0.2] vs [0.4, 0.6]. As k grows, both rows converge toward the same limiting vector — the stationary distribution that Section 9 analyses in depth. The matrix-multiplication perspective makes this convergence precise and computable.

Part 2 — First Implementation

4. Exercise 1 — Weather Model

Goal: Build a two-state Markov chain in Rust that models daily weather as either Sunny or Rainy, driven by a transition matrix, and simulate 30 days of weather.

Setup

Create a new Cargo binary project and add the rand crate:

cargo new weather-chain
cd weather-chain
cargo add rand

Starter Code

Replace the contents of src/main.rs with the following skeleton. Do not change the struct layout or function signatures — your task is to fill in the todo!() bodies and write main.

use rand::Rng;

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}

fn main() { todo!() }

Step 1 — Index conversion

step needs to index into self.transition using the current state, and later convert an index back to a Weather value. Add two helper methods to the Weather enum:

fn index(self) -> usize — returns 0 for Sunny, 1 for Rainy
fn from_index(i: usize) -> Self — returns Sunny for 0, Rainy for anything else

A simple match expression handles both. These are the only tools you need to bridge the enum and the matrix.

Step 2 — Implement `WeatherChain::step`

step must sample the next state from the probability row for the current state. The technique is a cumulative-probability walk:

Retrieve the transition row: let row = self.transition[current.index()];
Draw a uniform float in [0, 1): let r: f64 = rng.gen();
Walk the row, accumulating probability. When the running total first exceeds r, return that state.

For the two-state case this reduces to a single comparison — if r < row[0] return Sunny, otherwise return Rainy — but implementing the loop works for any number of states and is worth practising.

Add a fallback Weather::from_index(row.len() - 1) after the loop to satisfy the compiler; floating-point rounding can in rare cases leave the loop without returning.

Step 3 — Implement `WeatherChain::simulate`

simulate runs the chain for steps transitions, collecting every state visited including the start:

Allocate a Vec with capacity steps + 1.
Push start and set current = start.
Loop steps times: call self.step(current, rng), update current, and push.
Return the Vec.

The returned slice will have length steps + 1 (the initial state plus one state per step).

Step 4 — Run the simulation

In main, create a WeatherChain with the two-state matrix from Section 3:

Sunny [0.8, 0.2]
Rainy [0.4, 0.6]

Seed a repeatable RNG with rand::rngs::SmallRng::seed_from_u64(42) (add use rand::SeedableRng;). Simulate 30 steps starting from Sunny, print the resulting sequence, and count how many days were sunny vs rainy.

Expected output structure (exact numbers vary by seed):

[Sunny, Sunny, Rainy, Sunny, ...]
Sunny days: 21 (67.7%)
Rainy days: 10 (32.3%)

Step 5 — Compare to the stationary distribution

The stationary distribution π satisfies π = πP, meaning once the chain reaches it, the distribution no longer changes. For this matrix, solve the two equations:

π₀ = 0.8·π₀ + 0.4·π₁
π₁ = 0.2·π₀ + 0.6·π₁
π₀ + π₁ = 1

The first equation simplifies to 0.2·π₀ = 0.4·π₁, giving π₀ = 2·π₁. Substituting into the normalisation constraint: π₀ = 2/3 ≈ 66.7%, π₁ = 1/3 ≈ 33.3%.

Print the stationary percentages alongside your simulated counts. With only 31 data points the match will be rough; re-run with 1 000 or 10 000 steps to see the empirical frequencies converge.

Reference Solution

Show full solution

use rand::{Rng, SeedableRng, rngs::SmallRng};

#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }

impl Weather {
    fn index(self) -> usize {
        match self {
            Weather::Sunny => 0,
            Weather::Rainy => 1,
        }
    }

    fn from_index(i: usize) -> Self {
        match i {
            0 => Weather::Sunny,
            _ => Weather::Rainy,
        }
    }
}

struct WeatherChain {
    /// transition[current][next] = probability
    transition: [[f64; 2]; 2],
}

impl WeatherChain {
    fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather {
        let row = self.transition[current.index()];
        let r: f64 = rng.gen();
        let mut cumulative = 0.0;
        for (i, &prob) in row.iter().enumerate() {
            cumulative += prob;
            if r < cumulative {
                return Weather::from_index(i);
            }
        }
        Weather::from_index(row.len() - 1)
    }

    fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> {
        let mut states = Vec::with_capacity(steps + 1);
        states.push(start);
        let mut current = start;
        for _ in 0..steps {
            current = self.step(current, rng);
            states.push(current);
        }
        states
    }
}

fn main() {
    let mut rng = SmallRng::seed_from_u64(42);
    let chain = WeatherChain {
        transition: [[0.8, 0.2], [0.4, 0.6]],
    };

    let states = chain.simulate(Weather::Sunny, 30, &mut rng);
    println!("{:?}", states);

    let total = states.len() as f64;
    let sunny = states.iter().filter(|&&s| s == Weather::Sunny).count();
    let rainy = states.len() - sunny;

    println!("Sunny days: {} ({:.1}%)", sunny, 100.0 * sunny as f64 / total);
    println!("Rainy days: {} ({:.1}%)", rainy, 100.0 * rainy as f64 / total);
    println!("Stationary: Sunny ≈ 66.7%, Rainy ≈ 33.3%");
}

5. Exercise 2 — Simulating a Random Walk

Goal: Implement a one-dimensional random walk on the integers (states −N … +N) with reflecting boundaries, then measure the empirical distribution of positions after T steps.

struct RandomWalk {
    min: i32,
    max: i32,
    /// prob_right[i] = probability of stepping right from position i
    prob_right: Vec<f64>,
}

impl RandomWalk {
    fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
    fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
        -> HashMap<i32, usize> { todo!() }
}

🚧 This section is a stub — see nbd ticket 64826a

Part 3 — Text Generation

6. Text Generation with Markov Chains

Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the bigram model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here — it prepares you for Exercises 3 and 4.

Words as states. To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word "cat," what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of "cat" and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word — not on the full sentence or paragraph that came before. This is the Markov property applied to language.

Bigrams and transition tables. A bigram is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word w express those counts as a probability distribution over successor words. This distribution forms one row of the transition table: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data — no grammar rules, no meaning, just co-occurrence statistics.

Worked example. Consider this two-sentence corpus:

"the cat sat on the mat. the cat sat on the hat."

Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:

Current word	Next word	Count	Probability
the	cat	2	0.50
the	mat	1	0.25
the	hat	1	0.25
cat	sat	2	1.00
sat	on	2	1.00
on	the	2	1.00

Notice that "cat," "sat," and "on" each have only one possible successor — the small corpus left them no choice. Only "the" branches: half the time it is followed by "cat," a quarter of the time by "mat," and a quarter by "hat." Starting from "the," one plausible generated sequence is: the → cat → sat → on → the → hat — and then "hat" would be an end-of-sentence state.

Why the text sounds strange. Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.

Order-n chains. A bigram model is an order-1 Markov chain — the next state depends on exactly one previous word. An order-2 model (a trigram model) conditions on the last two words; order n uses the last n words as context. Increasing n brings sharply improved local coherence — at n = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim — but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher n without the model simply memorising its training data.

7. Exercise 3 — Bigram Text Generator

Goal: Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel sentences of a given length from a chosen seed word.

struct BigramModel {
    /// transitions[word] = list of (next_word, weight) pairs
    transitions: HashMap<String, Vec<(String, usize)>>,
}

impl BigramModel {
    fn train(corpus: &str) -> Self { todo!() }
    fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}

🚧 This section is a stub — see nbd ticket 74be50

8. Exercise 4 — N-gram Generalization

Goal: Generalize the bigram model to an n-gram model where each state is a window of n consecutive words. Compare the output quality for n = 1, 2, 3, and 4 on the same corpus.

struct NgramModel {
    n: usize,
    transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}

impl NgramModel {
    fn train(corpus: &str, n: usize) -> Self { todo!() }
    fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}

🚧 This section is a stub — see nbd ticket 1f995a

Part 4 — Deeper Concepts

9. Stationary Distributions

A stationary distribution π is a probability distribution over states that is unchanged by one step of the chain: π P = π. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique — introducing the concepts of irreducibility and aperiodicity.

🚧 This section is a stub — see nbd ticket 68ee16

10. Applications and Further Reading

Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.

🚧 This section is a stub — see nbd ticket 5994a6

18 KiB Raw Blame History Unescape Escape