13 KiB
Markov Chain Self-Guided Course
This document is a self-guided course on Markov chains. It is organized into four parts: conceptual foundations, first Rust implementations, text generation, and deeper theory. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an nbd ticket — follow the ticket ID to find the detailed learning objectives and instructions.
Table of Contents
Part 1 — Foundations
Part 2 — First Implementation
Part 3 — Text Generation
- Text Generation with Markov Chains
- Exercise 3 — Bigram Text Generator
- Exercise 4 — N-gram Generalization
Part 4 — Deeper Concepts
Part 1 — Foundations
1. What Is a Markov Chain?
A Markov chain is a mathematical model describing a sequence of events where the probability of each event depends only on the state reached in the previous event — not on the full history. This "memoryless" property is called the Markov property. You will learn where Markov chains appear in the real world and develop intuition for why the memoryless property is both a useful simplification and a meaningful assumption.
🚧 This section is a stub — see nbd ticket
fbf323
2. States and Transitions
Every Markov chain consists of a finite (or countably infinite) set of states and a rule for how the system moves between them. This section formalizes the vocabulary: states, transitions, directed graphs, and the notion of a chain's step or time. By the end you will be able to draw a state-transition diagram for a simple real-world system.
🚧 This section is a stub — see nbd ticket
738be2
3. Transition Probabilities and Matrices
The rules governing how a Markov chain moves are captured in a transition matrix P, where P[i][j] is the probability of moving from state i to state j in one step. This section covers how to construct P, the constraints it must satisfy (rows sum to 1), and how to use matrix multiplication to compute multi-step probabilities.
Defining the transition matrix. Label the states 0, 1, …, n−1. The transition matrix P is an n × n array where entry P[i][j] gives the probability of moving to state j on the very next step, given that you are currently in state i. Because these are probabilities of mutually exclusive, exhaustive outcomes (from state i you must go somewhere), every row must sum to exactly 1 and every entry must lie between 0 and 1 inclusive. A matrix satisfying these two constraints is called a stochastic matrix (or row-stochastic matrix). Each row is itself a probability distribution over the next state.
The stochastic-matrix constraints, stated precisely. For an n-state chain:
P[i][j] >= 0 for all i, j
sum_j P[i][j] = 1 for every row i
A zero entry means the transition is impossible; a one means it is certain. Columns have no such constraint — column sums need not equal 1.
Multi-step probabilities via matrix multiplication. Suppose you start in state i at time 0. After one step the probability of being in state j is P[i][j]. After two steps you pass through some intermediate state k, so:
P^2[i][j] = sum_k P[i][k] * P[k][j]
This is exactly the (i, j) entry of P × P = P². In general, the probability of going from state i to state j in exactly k steps is the (i, j) entry of P^k. If you encode your current uncertainty as a row vector π₀ — a probability distribution over all states — then after k steps your updated distribution is:
π_k = π₀ · P^k
Each right-multiplication by P advances the clock one tick and blends probabilities according to the transition rules.
Worked example — a two-state weather chain. Consider a model with two states: Sunny (state 0) and Rainy (state 1). From data:
- If today is Sunny, tomorrow is Sunny with probability 0.8 and Rainy with probability 0.2.
- If today is Rainy, tomorrow is Sunny with probability 0.4 and Rainy with probability 0.6.
Writing this as a matrix:
Sunny Rainy
Sunny [ 0.8 0.2 ]
Rainy [ 0.4 0.6 ]
Row 0 sums to 1.0; row 1 sums to 1.0. All entries are non-negative. P is a valid stochastic matrix.
One step. Start with certainty in Sunny: π₀ = [1, 0].
π₁ = π₀ · P = [1, 0] · [[0.8, 0.2], [0.4, 0.6]] = [0.8, 0.2]
Tomorrow: 80 % Sunny, 20 % Rainy.
Two steps. Apply P again:
π₂ = π₁ · P = [0.8, 0.2] · [[0.8, 0.2], [0.4, 0.6]]
= [0.8*0.8 + 0.2*0.4, 0.8*0.2 + 0.2*0.6]
= [0.72, 0.28]
Equivalently, compute P² once and read off the row for state 0:
P^2 = [[0.8*0.8 + 0.2*0.4, 0.8*0.2 + 0.2*0.6],
[0.4*0.8 + 0.6*0.4, 0.4*0.2 + 0.6*0.6]]
= [[0.72, 0.28],
[0.56, 0.44]]
P²[0] = [0.72, 0.28] — matching the step-by-step result. Starting from Rainy gives P²[1] = [0.56, 0.44]; the two rows are already noticeably closer to each other than the original [0.8, 0.2] vs [0.4, 0.6]. As k grows, both rows converge toward the same limiting vector — the stationary distribution that Section 9 analyses in depth. The matrix-multiplication perspective makes this convergence precise and computable.
Part 2 — First Implementation
4. Exercise 1 — Weather Model
Goal: Build a two-state Markov chain in Rust that models daily weather as either Sunny or Rainy, driven by a transition matrix, and simulate 30 days of weather.
#[derive(Debug, Clone, Copy, PartialEq)]
enum Weather { Sunny, Rainy }
struct WeatherChain {
/// transition[current][next] = probability
transition: [[f64; 2]; 2],
}
impl WeatherChain {
fn step(&self, current: Weather, rng: &mut impl Rng) -> Weather { todo!() }
fn simulate(&self, start: Weather, steps: usize, rng: &mut impl Rng) -> Vec<Weather> { todo!() }
}
🚧 This section is a stub — see nbd ticket
257a2a
5. Exercise 2 — Simulating a Random Walk
Goal: Implement a one-dimensional random walk on the integers (states −N … +N) with reflecting boundaries, then measure the empirical distribution of positions after T steps.
struct RandomWalk {
min: i32,
max: i32,
/// prob_right[i] = probability of stepping right from position i
prob_right: Vec<f64>,
}
impl RandomWalk {
fn step(&self, pos: i32, rng: &mut impl Rng) -> i32 { todo!() }
fn histogram(&self, start: i32, steps: usize, trials: usize, rng: &mut impl Rng)
-> HashMap<i32, usize> { todo!() }
}
🚧 This section is a stub — see nbd ticket
64826a
Part 3 — Text Generation
6. Text Generation with Markov Chains
Text can be modeled as a Markov chain where each state is a word (or sequence of words) and transitions represent which words tend to follow. This section explains the bigram model, why it produces surprisingly coherent short sequences, and what its limitations reveal about the relationship between statistical models and language. No code is written here — it prepares you for Exercises 3 and 4.
Words as states. To model text as a Markov chain, treat each word as a state and the act of writing the next word as a transition. Given the word "cat," what word comes next? A Markov model answers that question by consulting statistics drawn from a training corpus: it scans every occurrence of "cat" and records which words followed it, then samples from that empirical distribution. The result is a sequence of words generated one step at a time, each word chosen probabilistically based only on the current word — not on the full sentence or paragraph that came before. This is the Markov property applied to language.
Bigrams and transition tables. A bigram is an ordered pair of adjacent words. To build a bigram model, scan the corpus left to right and record every consecutive word pair. Count how many times each pair appears, then for each word w express those counts as a probability distribution over successor words. This distribution forms one row of the transition table: a lookup from every word to a weighted list of what can follow it. The table is learned entirely from data — no grammar rules, no meaning, just co-occurrence statistics.
Worked example. Consider this two-sentence corpus:
"the cat sat on the mat. the cat sat on the hat."
Scanning the corpus (treating each sentence as a word sequence and ignoring the periods) yields the following bigrams and their counts. Dividing each count by the row total gives the transition probability:
| Current word | Next word | Count | Probability |
|---|---|---|---|
| the | cat | 2 | 0.50 |
| the | mat | 1 | 0.25 |
| the | hat | 1 | 0.25 |
| cat | sat | 2 | 1.00 |
| sat | on | 2 | 1.00 |
| on | the | 2 | 1.00 |
Notice that "cat," "sat," and "on" each have only one possible successor — the small corpus left them no choice. Only "the" branches: half the time it is followed by "cat," a quarter of the time by "mat," and a quarter by "hat." Starting from "the," one plausible generated sequence is: the → cat → sat → on → the → hat — and then "hat" would be an end-of-sentence state.
Why the text sounds strange. Short stretches of output are surprisingly readable: every adjacent pair the model produces appeared in the training data, so each local transition is both grammatically and semantically plausible. The problem surfaces over longer spans. Because the model has no memory beyond the current word, it cannot maintain a topic, complete a thought, or avoid contradictions introduced a few sentences back. The result resembles text written by someone who half-knows the language: each individual step looks right, but the destination keeps shifting without purpose.
Order-n chains. A bigram model is an order-1 Markov chain — the next state depends on exactly one previous word. An order-2 model (a trigram model) conditions on the last two words; order n uses the last n words as context. Increasing n brings sharply improved local coherence — at n = 3 or 4 the output begins to reproduce full phrases from the corpus verbatim — but at the cost of novelty. A high-order model has likely seen most of its context only once, so it often has no real choice but to copy the source text rather than recombine it. The right tradeoff depends on corpus size: larger corpora can support higher n without the model simply memorising its training data.
7. Exercise 3 — Bigram Text Generator
Goal: Build a Markov chain over words from an input text. Each state is a single word; transitions are learned from the corpus. Generate novel sentences of a given length from a chosen seed word.
struct BigramModel {
/// transitions[word] = list of (next_word, weight) pairs
transitions: HashMap<String, Vec<(String, usize)>>,
}
impl BigramModel {
fn train(corpus: &str) -> Self { todo!() }
fn generate(&self, seed: &str, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
🚧 This section is a stub — see nbd ticket
74be50
8. Exercise 4 — N-gram Generalization
Goal: Generalize the bigram model to an n-gram model where each state is a window of n consecutive words. Compare the output quality for n = 1, 2, 3, and 4 on the same corpus.
struct NgramModel {
n: usize,
transitions: HashMap<Vec<String>, Vec<(String, usize)>>,
}
impl NgramModel {
fn train(corpus: &str, n: usize) -> Self { todo!() }
fn generate(&self, seed: Vec<String>, length: usize, rng: &mut impl Rng) -> Vec<String> { todo!() }
}
🚧 This section is a stub — see nbd ticket
1f995a
Part 4 — Deeper Concepts
9. Stationary Distributions
A stationary distribution π is a probability distribution over states that is unchanged by one step of the chain: π P = π. This section covers how to find stationary distributions analytically (for small chains) and via power iteration, and explains when they exist and are unique — introducing the concepts of irreducibility and aperiodicity.
🚧 This section is a stub — see nbd ticket
68ee16
10. Applications and Further Reading
Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.
🚧 This section is a stub — see nbd ticket
5994a6