docs(edu): outline ML self-play chapter and create section tickets [edu-coqp]
Add edu/src/ml-self-play.md with 14 stubbed sections across 5 parts: foundations (RL, MCTS, self-play), game engine (Tic-Tac-Toe), MCTS implementation, neural network integration, and the full self-play loop. Create one beans ticket per section (edu-wobk through edu-brtk). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>main
parent
5a1fb26927
commit
818444962c
@ -0,0 +1,130 @@
|
|||||||
|
# Machine Learning: Training a Game AI Through Self-Play
|
||||||
|
|
||||||
|
This is a self-guided course on reinforcement learning through the lens of a concrete goal: training a Rust program to play Tic-Tac-Toe by playing against itself, in the style of AlphaGo Zero. No prior ML experience is assumed. You will build everything from scratch — a game engine, a search algorithm, and eventually a neural network that guides that search. Sections marked 🚧 are stubs whose full content is tracked in a beans ticket.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
**Part 1 — Foundations**
|
||||||
|
|
||||||
|
1. [What is reinforcement learning?](#1-what-is-reinforcement-learning)
|
||||||
|
2. [Monte Carlo Tree Search — algorithm explained](#2-monte-carlo-tree-search--algorithm-explained)
|
||||||
|
3. [Why self-play? The AlphaGo Zero insight](#3-why-self-play-the-alphago-zero-insight)
|
||||||
|
|
||||||
|
**Part 2 — The Game**
|
||||||
|
|
||||||
|
4. [Choosing a simple game: Tic-Tac-Toe](#4-choosing-a-simple-game-tic-tac-toe)
|
||||||
|
5. [Representing game state in Rust](#5-representing-game-state-in-rust)
|
||||||
|
6. [Exercise 1: implement the game logic](#6-exercise-1-implement-the-game-logic)
|
||||||
|
|
||||||
|
**Part 3 — MCTS**
|
||||||
|
|
||||||
|
7. [Implementing MCTS in Rust](#7-implementing-mcts-in-rust)
|
||||||
|
8. [Exercise 2: play Tic-Tac-Toe with pure MCTS](#8-exercise-2-play-tic-tac-toe-with-pure-mcts)
|
||||||
|
|
||||||
|
**Part 4 — Neural Network Policy/Value Head**
|
||||||
|
|
||||||
|
9. [Neural network architecture overview](#9-neural-network-architecture-overview)
|
||||||
|
10. [Integrating a neural network crate](#10-integrating-a-neural-network-crate)
|
||||||
|
11. [Exercise 3: train the network on MCTS data](#11-exercise-3-train-the-network-on-mcts-data)
|
||||||
|
12. [Exercise 4: replace rollout with the value network](#12-exercise-4-replace-rollout-with-the-value-network)
|
||||||
|
|
||||||
|
**Part 5 — Self-Play Loop**
|
||||||
|
|
||||||
|
13. [The full AlphaGo Zero training loop](#13-the-full-alphago-zero-training-loop)
|
||||||
|
14. [Exercise 5: 1000 self-play games; observe improvement](#14-exercise-5-1000-self-play-games-observe-improvement)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 1 — Foundations
|
||||||
|
|
||||||
|
### 1. What is reinforcement learning?
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-wobk].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Monte Carlo Tree Search — algorithm explained
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-3yw9].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Why self-play? The AlphaGo Zero insight
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-5go8].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 2 — The Game
|
||||||
|
|
||||||
|
### 4. Choosing a simple game: Tic-Tac-Toe
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-k3tq].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Representing game state in Rust
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-e39n].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Exercise 1: implement the game logic
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-ymux].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 3 — MCTS
|
||||||
|
|
||||||
|
### 7. Implementing MCTS in Rust
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-of9y].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. Exercise 2: play Tic-Tac-Toe with pure MCTS
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-4v13].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 4 — Neural Network Policy/Value Head
|
||||||
|
|
||||||
|
### 9. Neural network architecture overview
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-iv0k].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 10. Integrating a neural network crate
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-pvou].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Exercise 3: train the network on MCTS data
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-lqky].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 12. Exercise 4: replace rollout with the value network
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-7lu6].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 5 — Self-Play Loop
|
||||||
|
|
||||||
|
### 13. The full AlphaGo Zero training loop
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-453h].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 14. Exercise 5: 1000 self-play games; observe improvement
|
||||||
|
|
||||||
|
🚧 This section is a stub. Full content tracked in [edu-brtk].
|
||||||
Loading…
Reference in New Issue