docs(edu): write markov §10 applications and further reading [5994a6]

Survey PageRank, MCMC, HMMs, RL, bioinformatics, queueing theory, and language models; add annotated reading list and Rust ecosystem pointers.
5 months ago · 0f772edb7a
parent 401ad5f750
commit 0f772edb7a
2 changed files with 51 additions and 1 deletions
--- a/edu/.nbd/tickets/5994a6.md
+++ b/edu/.nbd/tickets/5994a6.md
@ -0,0 +1,8 @@
 +++
 title = "Markov lesson: Applications and Further Reading"
 priority = 7
 status = "done"
 ticket_type = "task"
 dependencies = []
 +++
 Write Section 10 of edu/markov.md: 'Applications and Further Reading'\n\nLearning objectives:\n- Survey real-world applications: PageRank, MCMC, HMMs, RL, bioinformatics\n- Give a 2–3 sentence description of each application and how Markov chains appear\n- Point to concrete next steps for learners who want to go deeper\n\nContent to produce:\n- Application survey (5–7 topics, each 2–3 sentences)\n- Annotated reading list:\n  * 'Introduction to Probability' (Blitzstein & Hwang) — free PDF\n  * 'Markov Chains' (Norris) — rigorous treatment\n  * Sutton & Barto 'Reinforcement Learning' — RL connection\n  * The Metropolis-Hastings algorithm article on Wikipedia\n- Rust ecosystem pointers (crates for probabilistic modelling)\n\nTarget: replace the stub in edu/markov.md §10
--- a/edu/markov.md
+++ b/edu/markov.md
@ -416,4 +416,46 @@ A **stationary distribution** *π* is a probability distribution over states tha
 Markov chains appear throughout computer science and mathematics: PageRank, MCMC sampling, hidden Markov models, reinforcement learning, and more. This section surveys these applications at a high level and points to books, papers, and courses for learners who want to go deeper on any thread.
-> 🚧 This section is a stub — see nbd ticket `5994a6`
+#### Application Survey
 **PageRank.** Google's original ranking algorithm modeled the web as a Markov chain: each page is a state, and each hyperlink is a transition with probability proportional to the number of outgoing links on the source page. A small "teleportation" probability was added so a random surfer occasionally jumps to a uniformly random page rather than following a link, ensuring the chain is irreducible and has a unique stationary distribution. The stationary probability of each page — the fraction of time a random surfer spends there in the long run — becomes its rank. Because the web had billions of pages, computing the stationary distribution via power iteration rather than direct matrix inversion was essential; the same convergence guarantee from Section 9 applies at planetary scale.
 **Markov Chain Monte Carlo (MCMC).** Bayesian inference often requires integrating over high-dimensional parameter spaces where the posterior distribution has no closed form. MCMC methods solve this by constructing a Markov chain whose stationary distribution *is* the target posterior, then running the chain long enough that its samples approximate draws from that distribution. The **Metropolis-Hastings** algorithm is the foundational recipe: propose a move to a new state, accept it with a probability that preserves detailed balance, and reject it otherwise. Variants such as Gibbs sampling, Hamiltonian Monte Carlo, and the No-U-Turn Sampler (NUTS) power nearly every modern probabilistic programming framework, from Stan to PyMC.
 **Hidden Markov Models (HMMs).** An HMM separates a Markov chain into two layers: a hidden state sequence that evolves according to a transition matrix, and an observation sequence where each hidden state emits an observable symbol with some probability. The key insight is that the true states are never directly seen — only the observations are. HMMs were the dominant approach in speech recognition for decades (phonemes as hidden states, acoustic features as observations) and remain central in bioinformatics for gene prediction and sequence segmentation. The Viterbi algorithm finds the most likely hidden state path for a given observation sequence in time linear in the sequence length; the Baum-Welch algorithm trains HMMs from unlabelled data using expectation-maximisation.
 **Reinforcement Learning.** Most reinforcement learning problems are formulated as **Markov Decision Processes** (MDPs), which augment a Markov chain with a set of actions and a scalar reward signal. At each step the agent chooses an action, the environment transitions to a new state according to a transition distribution that depends on both the current state and the chosen action, and the agent receives a reward. The goal is to find a policy — a rule mapping states to actions — that maximises cumulative discounted reward. Because the next state depends only on the current state and action (not the history), the Markov property is what makes value-function algorithms like Q-learning and policy-gradient methods tractable. Sutton and Barto's textbook (listed below) treats this connection in full rigour.
 **Bioinformatics — Sequence Analysis.** DNA and protein sequences are naturally modelled as Markov chains over an alphabet of bases or amino acids. A simple *k*-th order Markov model assigns probabilities to short subsequences and can distinguish coding regions from non-coding regions: CpG islands in mammalian genomes, for example, have transition probabilities measurably different from the genomic background. Profile HMMs generalise this to align whole families of sequences — each column in a multiple sequence alignment becomes a hidden state with its own emission distribution, allowing robust database search even for distantly related proteins.
 **Queueing Theory.** The classic M/M/1 queue — arrivals according to a Poisson process, exponential service times, a single server — is a continuous-time Markov chain on the non-negative integers, where the state is the number of customers in the system. Its stationary distribution is geometric, giving simple closed-form expressions for average queue length and waiting time. More complex queueing networks (multiple servers, priorities, finite buffers) extend the same framework and are used to size data-centre infrastructure, analyse hospital emergency departments, and design network switches. Continuous-time Markov chains replace the transition *matrix* with a **generator matrix** *Q* whose off-diagonal entries are transition *rates* rather than probabilities.
 **Language Models.** The n-gram models from Exercises 3 and 4 are finite-order Markov chains over word tokens, and they directly preceded modern neural language models. In the 1990s and 2000s, trigram and 4-gram models with smoothing (Kneser-Ney, Witten-Bell) were the state of the art for machine translation and speech recognition. Neural language models replaced explicit Markov structure with learned representations, but the conceptual scaffolding is the same: predict the next token from a bounded window of context. Understanding the Markov chain view of language — its local coherence, its lack of long-range memory, its reliance on corpus statistics — clarifies both what earlier systems got right and what neural approaches had to learn to transcend.
 ---
 #### Further Reading
 **Books**
 - **Blitzstein & Hwang — *Introduction to Probability* (2nd ed.)** A beautifully written undergraduate probability text with a full chapter on Markov chains. The authors' lecture videos and course materials are freely available; a free PDF of the book is offered on the book's companion site. Start here if your probability background is thin or if you want every concept illustrated with concrete examples before the formalism arrives.
 - **Norris — *Markov Chains*** The standard rigorous treatment at the advanced-undergraduate / early-graduate level. Covers discrete- and continuous-time chains, convergence, reversibility, and applications with full proofs. Dense but thorough; worth working through if you intend to read research papers that use Markov chains as a theoretical tool.
 - **Sutton & Barto — *Reinforcement Learning: An Introduction* (2nd ed.)** The canonical RL textbook, freely available as a PDF from the authors. Chapters 3 and 4 formalise MDPs and dynamic programming using exactly the Markov chain machinery developed in this course. Reading those chapters after completing this course is a natural next step toward understanding how modern game-playing and robotics agents are designed.
 **Articles**
 - **Metropolis-Hastings algorithm — Wikipedia.** A well-maintained article that covers the algorithm statement, acceptance-ratio derivation, intuition for why detailed balance implies the correct stationary distribution, and pseudocode. A good companion to the original 1953 Metropolis *et al.* paper (five pages, freely available) and the 1970 Hastings generalisation.
 ---
 #### Rust Ecosystem Pointers
 The exercises in this course used `rand` for sampling. A few other crates are useful as you build more serious probabilistic programs in Rust:
 - **`nalgebra`** — a comprehensive linear-algebra library covering vectors, dense matrices, and decompositions. Use it to compute *P*^*k* via repeated matrix multiplication or to solve the stationary-distribution equations *π P = π*, *Σ πᵢ = 1* as a linear system.
 - **`petgraph`** — graph data structures and algorithms. Markov chains are directed weighted graphs, and `petgraph` lets you represent them as such, run graph-theoretic algorithms (strongly connected components give you irreducible sub-chains), and visualise the structure via its Graphviz export.
 - **`statrs`** — a statistics library providing common probability distributions with their PDFs, CDFs, and samplers. Useful when building emission distributions for HMMs or when you need chi-squared tests to check whether simulated chain frequencies match theoretical stationary probabilities.