From 955cf029ab0fbed4fc8a0693c07b381e7353561a Mon Sep 17 00:00:00 2001 From: Elijah Voigt Date: Fri, 27 Feb 2026 20:44:17 -0800 Subject: [PATCH] docs(edu): add lisp-to-C compiler course with stubs and tickets [67e284] 18-section interactive course teaching compiler construction in Rust using nom. Covers MiniLisp parsing, AST design, semantic analysis, and C code generation. All sections stubbed; one nbd task ticket per section plus a project ticket (67e284) tracking completion. Co-Authored-By: Claude Sonnet 4.6 --- PROJECTS.md | 9 -- edu/.nbd/tickets/081a55.md | 74 ++++++++++ edu/.nbd/tickets/1d16da.md | 80 ++++++++++ edu/.nbd/tickets/1eb794.md | 155 ++++++++++++++++++++ edu/.nbd/tickets/1ef9f4.md | 80 ++++++++++ edu/.nbd/tickets/21d9be.md | 18 +++ edu/.nbd/tickets/37cdd5.md | 21 +++ edu/.nbd/tickets/3aeb62.md | 101 +++++++++++++ edu/.nbd/tickets/3dc36b.md | 131 +++++++++++++++++ edu/.nbd/tickets/3e1250.md | 125 ++++++++++++++++ edu/.nbd/tickets/4c961f.md | 63 ++++++++ edu/.nbd/tickets/5674ce.md | 65 ++++++++ edu/.nbd/tickets/5835e9.md | 166 +++++++++++++++++++++ edu/.nbd/tickets/584e0c.md | 37 +++++ edu/.nbd/tickets/58b37a.md | 162 ++++++++++++++++++++ edu/.nbd/tickets/5ed295.md | 90 ++++++++++++ edu/.nbd/tickets/67e284.md | 54 +++++++ edu/.nbd/tickets/685f5e.md | 166 +++++++++++++++++++++ edu/.nbd/tickets/6d40a7.md | 123 ++++++++++++++++ edu/.nbd/tickets/6ec5ff.md | 61 ++++++++ edu/.nbd/tickets/8fa47a.md | 176 ++++++++++++++++++++++ edu/.nbd/tickets/99e1d9.md | 37 +++++ edu/.nbd/tickets/a1a827.md | 151 +++++++++++++++++++ edu/.nbd/tickets/a4c9f8.md | 192 ++++++++++++++++++++++++ edu/.nbd/tickets/a93829.md | 125 ++++++++++++++++ edu/.nbd/tickets/b6c9ad.md | 137 +++++++++++++++++ edu/.nbd/tickets/b7c95f.md | 50 +++++++ edu/.nbd/tickets/cbc6e3.md | 157 ++++++++++++++++++++ edu/.nbd/tickets/d0b9f8.md | 154 +++++++++++++++++++ edu/.nbd/tickets/d9f850.md | 46 ++++++ edu/.nbd/tickets/de82f1.md | 157 ++++++++++++++++++++ edu/.nbd/tickets/e8be9a.md | 71 +++++++++ edu/.nbd/tickets/e8da8b.md | 77 ++++++++++ edu/TODO.md | 17 +++ edu/flake.lock | 29 +++- edu/flake.nix | 22 ++- edu/src/SUMMARY.md | 8 + edu/src/lisp-compiler.md | 194 ++++++++++++++++++++++++ edu/src/vector-db.md | 293 +++++++++++++++++++++++++++++++++++++ 39 files changed, 3859 insertions(+), 15 deletions(-) create mode 100644 edu/.nbd/tickets/081a55.md create mode 100644 edu/.nbd/tickets/1d16da.md create mode 100644 edu/.nbd/tickets/1eb794.md create mode 100644 edu/.nbd/tickets/1ef9f4.md create mode 100644 edu/.nbd/tickets/21d9be.md create mode 100644 edu/.nbd/tickets/37cdd5.md create mode 100644 edu/.nbd/tickets/3aeb62.md create mode 100644 edu/.nbd/tickets/3dc36b.md create mode 100644 edu/.nbd/tickets/3e1250.md create mode 100644 edu/.nbd/tickets/4c961f.md create mode 100644 edu/.nbd/tickets/5674ce.md create mode 100644 edu/.nbd/tickets/5835e9.md create mode 100644 edu/.nbd/tickets/584e0c.md create mode 100644 edu/.nbd/tickets/58b37a.md create mode 100644 edu/.nbd/tickets/5ed295.md create mode 100644 edu/.nbd/tickets/67e284.md create mode 100644 edu/.nbd/tickets/685f5e.md create mode 100644 edu/.nbd/tickets/6d40a7.md create mode 100644 edu/.nbd/tickets/6ec5ff.md create mode 100644 edu/.nbd/tickets/8fa47a.md create mode 100644 edu/.nbd/tickets/99e1d9.md create mode 100644 edu/.nbd/tickets/a1a827.md create mode 100644 edu/.nbd/tickets/a4c9f8.md create mode 100644 edu/.nbd/tickets/a93829.md create mode 100644 edu/.nbd/tickets/b6c9ad.md create mode 100644 edu/.nbd/tickets/b7c95f.md create mode 100644 edu/.nbd/tickets/cbc6e3.md create mode 100644 edu/.nbd/tickets/d0b9f8.md create mode 100644 edu/.nbd/tickets/d9f850.md create mode 100644 edu/.nbd/tickets/de82f1.md create mode 100644 edu/.nbd/tickets/e8be9a.md create mode 100644 edu/.nbd/tickets/e8da8b.md create mode 100644 edu/TODO.md create mode 100644 edu/src/lisp-compiler.md create mode 100644 edu/src/vector-db.md diff --git a/PROJECTS.md b/PROJECTS.md index 95d2cc2..4d654db 100644 --- a/PROJECTS.md +++ b/PROJECTS.md @@ -12,12 +12,3 @@ - [ ] Create an OpenTofu Claude skill - [ ] Create a Nix Flake skill - [ ] Local-first app development - -## Interactive Learning and Education -- [x] Git worktrees, how do use please -- [ ] How to structure a co-op profit sharing worker owned business -- [x] Hands-on: Markov Chains -- [ ] Hands-on: Vector Databases -- [ ] Hands-on: Creating and training a simple LLM -- [ ] Hands-on: Writing your own language (lisp, interpreted, compiled to C) -- [ ] Hands-on: Shader programming diff --git a/edu/.nbd/tickets/081a55.md b/edu/.nbd/tickets/081a55.md new file mode 100644 index 0000000..e7373c3 --- /dev/null +++ b/edu/.nbd/tickets/081a55.md @@ -0,0 +1,74 @@ ++++ +title = "§7 Exercise 1: Storing and Retrieving Vectors" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §7 Exercise 1 — Storing and Retrieving Vectors — Stub to fill + +File: `edu/src/vector-db.md`, section `### 7. Exercise 1 — Storing and Retrieving Vectors` + +Replace this stub line with the full exercise: +> **Goal:** Insert a small set of labelled vectors [...] 🚧 Full content tracked in [nbd:081a55]. + +Follow the exercise format from `edu/src/markov.md`: Goal, Setup, Starter Code skeleton, numbered Steps, Reference Solution in `
Show full solution`. + +## Prerequisites (established in §6) + +Reader has the `vec-demo` project with `libsql = "0.9"` and `tokio`. The `main` function opens a local connection via `Builder::new_local("vectors.db").build().await?` and has already created the `items` table (`id INTEGER PRIMARY KEY, label TEXT NOT NULL, embedding F32_BLOB(3) NOT NULL`) and the HNSW index. + +## Goal + +Insert 6 labelled 3-dimensional vectors, then SELECT all rows and print each label alongside its deserialized `Vec`. + +## Vectors to use + +| id | label | embedding | +|---|---|---| +| 1 | "cat" | [0.9, 0.1, 0.2] | +| 2 | "dog" | [0.8, 0.2, 0.3] | +| 3 | "car" | [0.1, 0.9, 0.1] | +| 4 | "truck" | [0.2, 0.8, 0.2] | +| 5 | "python" | [0.15, 0.1, 0.95] | +| 6 | "rust" | [0.1, 0.05, 0.9] | + +Hand-crafted so animals cluster near [high, low, low], vehicles near [low, high, low], and programming languages near [low, low, high]. The §8 KNN exercise uses these clusters to verify correct nearest-neighbour results. + +## Steps to cover + +**Step 1 — Formatting a vector for INSERT.** Explain that `vector(?)` in SQL accepts a JSON array string. Show how to format a `Vec` in Rust: + +```rust +fn vec_to_json(v: &[f32]) -> String { + format!("[{}]", v.iter().map(|x| x.to_string()).collect::>().join(",")) +} +``` + +**Step 2 — Inserting rows.** Use `INSERT OR IGNORE` so re-running the program is idempotent: + +```sql +INSERT OR IGNORE INTO items (id, label, embedding) VALUES (?, ?, vector(?)) +``` + +Loop over a `Vec<(i64, &str, Vec)>` and call `conn.execute` for each row, passing id, label, and the JSON string as parameters. + +**Step 3 — Selecting and deserialising.** Query all rows: + +```sql +SELECT id, label, vector_extract(embedding) FROM items ORDER BY id +``` + +`vector_extract` returns a JSON array string (e.g. `"[0.9,0.1,0.2]"`). Add `serde_json = "1"` to Cargo.toml and parse it: `serde_json::from_str::>(&json_str)?`. + +**Step 4 — Print results.** Format output as: + +``` +1 cat [0.9, 0.1, 0.2] +2 dog [0.8, 0.2, 0.3] +... +``` + +## Cargo.toml additions + +Add `serde_json = "1"` for JSON array parsing. \ No newline at end of file diff --git a/edu/.nbd/tickets/1d16da.md b/edu/.nbd/tickets/1d16da.md new file mode 100644 index 0000000..d4bb147 --- /dev/null +++ b/edu/.nbd/tickets/1d16da.md @@ -0,0 +1,80 @@ ++++ +title = "§18 What's Next: Extensions and Further Reading" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §18 What's Next: Extensions and Further Reading — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 18. What's Next: Extensions and Further Reading` + +Replace the stub line with full content. Target 600–800 words. Survey the directions the compiler can be taken and provide a curated reading list. Reading-only, no code. + +## Learning objectives + +- Understand what limitations the current compiler has and why +- Know the conceptual approaches for each major extension +- Have a reading list for going deeper into compiler theory and Lisp implementation + +## Content to write + +### Congratulations — and what you skipped + +Open by acknowledging what the reader has built: a complete compiler with a lexer, parser, semantic analyser, code generator, and test suite. Then honestly catalog what was left out: + +### Extension 1: Closures and Lambda Lifting + +The current compiler does not support closures — lambdas cannot capture variables from enclosing functions. Adding closures requires **lambda lifting**: transforming each lambda that captures free variables into a top-level function that takes those variables as extra parameters. This is a classical technique. Real Lisp runtimes use **closure records** (a struct containing the function pointer and captured values) allocated on the heap. + +### Extension 2: Tail-Call Optimisation (TCO) + +`(define (loop n) (if (= n 0) n (loop (- n 1))))` will stack overflow for large `n` in the current compiler because each recursive call pushes a new C stack frame. TCO transforms tail calls into jumps. In C, this can be approximated with the `__attribute__((optimize("O2")))` pragma or by using a trampoline pattern. A proper solution requires detecting tail-call position during code generation and emitting a `goto` loop. + +### Extension 3: A Type System + +Add type inference (Hindley-Milner or a simpler Hindley-style bidirectional checker) so that type errors are caught before C is emitted. This would allow the code generator to choose the correct `ml_display_*` variant and generate proper C function signatures for string-returning functions. + +### Extension 4: Pairs, Lists, and a Runtime + +`(cons a b)`, `(car p)`, `(cdr p)` require heap-allocated pair objects — a proper C struct. This opens the door to proper Lisp list processing. Once you have heap allocation, you need a garbage collector. The simplest GC is reference counting; a more robust approach is mark-and-sweep. + +### Extension 5: Macros + +Lisp macros transform code before compilation. A simple approach is **syntax transformers**: functions that run at compile time and return transformed AST nodes. This requires a small interpreter for the macro language. Hygienic macros (as in Scheme) are significantly more complex. + +### Extension 6: A REPL + +A read-eval-print loop compiles and runs one expression at a time. This requires either an interpreter (easier) or incremental native code emission (harder). An interpreter over the AST is a natural extension once the parser is complete — it's essentially the code generator replaced with a recursive evaluator. + +### Extension 7: Self-Hosting + +The ultimate milestone: rewrite the MiniLisp compiler in MiniLisp itself. This requires the language to be expressive enough (strings, I/O, some form of list processing) and the compiler to be complete enough to compile itself. Self-hosting is the proof that you've really built something. + +### Further Reading + +**Compiler theory:** +- *Crafting Interpreters* by Robert Nystrom — free online; builds a language in two complete implementations (tree-walking and bytecode) +- *Modern Compiler Implementation in ML/Java/C* by Andrew Appel — classic academic compiler textbook +- *Engineering a Compiler* by Cooper & Torczon — comprehensive modern treatment + +**Lisp implementation:** +- *Structure and Interpretation of Computer Programs* (SICP) — chapters 4 and 5 cover interpreters and compilers for Scheme +- *Lisp in Small Pieces* by Christian Queinnec — 11 different Lisp implementations, from interpreter to compiler +- *Build Your Own Lisp* by Daniel Holden — free online, C implementation + +**Parsing:** +- *Parsing Techniques* by Grune & Jacobs — comprehensive reference (free PDF) +- nom documentation and recipes: https://github.com/rust-bakery/nom/tree/main/doc + +**Rust and compilers:** +- The `cranelift` crate — a code generator backend (Rust, used in Wasmtime) +- `inkwell` crate — safe Rust bindings to LLVM for native code generation + +## Style notes + +- Open warmly — the reader has accomplished something real +- Each extension should be a paragraph: what it is, why it is non-trivial, and the key technique +- The reading list is the most durable part of this section; keep it current and annotated +- Close the course with encouragement: the concepts learned here (parsing, AST manipulation, code generation) apply to every compiler, transpiler, and language tool the reader will ever build diff --git a/edu/.nbd/tickets/1eb794.md b/edu/.nbd/tickets/1eb794.md new file mode 100644 index 0000000..b988c50 --- /dev/null +++ b/edu/.nbd/tickets/1eb794.md @@ -0,0 +1,155 @@ ++++ +title = "§13 Generating C: Atoms and Expressions" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §13 Generating C: Atoms and Expressions — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 13. Generating C: Atoms and Expressions` + +Replace the stub line with full content. Target 700–900 words. Implement the expression code generator — the recursive function that turns any `Expr` into a C expression string. + +## Learning objectives + +- Implement `gen_expr` as a recursive function over `Expr` +- Know how each atom type maps to a C literal +- Understand how binary operator calls map to C infix expressions +- Handle `display`, `newline`, `error` as special cases in call generation +- Understand why all output is C *expressions* (not statements) at this level + +## Content to write + +### Expressions, not statements + +In C, everything that produces a value is an expression. At this stage, the code generator works entirely with expressions — `gen_expr` always returns a C expression string that can appear on the right-hand side of an assignment or as a function argument. Statement generation (for sequencing and side effects) comes in §15. + +### `gen_expr` — the core function + +```rust +/// Generate a C expression from a MiniLisp `Expr`. +/// +/// Returns a `String` containing valid C code that evaluates to the +/// same value as the original expression. +pub fn gen_expr(expr: &Expr) -> String { + match expr { + Expr::Int(n) => n.to_string(), + Expr::Bool(b) => if *b { "ML_TRUE".to_string() } else { "ML_FALSE".to_string() }, + Expr::Str(s) => format!("\"{}\"", s.escape_default()), + Expr::Symbol(name) => mangle(name), + Expr::If { cond, then, else_ } => + format!("({} ? {} : {})", gen_expr(cond), gen_expr(then), gen_expr(else_)), + Expr::Call { func, args } => gen_call(func, args), + // These should not appear at expression level — handled as statements in §15 + Expr::Begin(_) | Expr::Define { .. } | Expr::Lambda { .. } | Expr::Let { .. } => + panic!("gen_expr called on a statement-level form"), + } +} +``` + +Walk through each arm: + +**`Int(n)`** → decimal string. `42` → `"42"`, `-7` → `"-7"`. + +**`Bool(b)`** → `"ML_TRUE"` or `"ML_FALSE"` (the `#define`s from the preamble). + +**`Str(s)`** → a C string literal. Use Rust's `escape_default()` to re-escape the string, then wrap in double quotes. This safely handles embedded newlines and quotes. + +**`Symbol(name)`** → `mangle(name)`. A symbol in expression position is a variable reference; mangling produces the correct C identifier. + +**`If { cond, then, else_ }`** → C ternary: `(cond ? then : else)`. Parenthesised to avoid operator precedence issues. + +**`Call { func, args }`** → delegate to `gen_call`. + +### `gen_call` — operator and function calls + +```rust +fn gen_call(func: &Expr, args: &[Expr]) -> String { + // Built-in binary operators + if let Expr::Symbol(op) = func { + match op.as_str() { + "+" | "-" | "*" | "/" => { + let a = gen_expr(&args[0]); + let b = gen_expr(&args[1]); + return format!("({} {} {})", a, op, b); + } + "=" => return format!("({} == {})", gen_expr(&args[0]), gen_expr(&args[1])), + "<" | ">" | "<=" | ">=" => { + return format!("({} {} {})", gen_expr(&args[0]), op, gen_expr(&args[1])); + } + "not" => return format!("(!{})", gen_expr(&args[0])), + // display / newline / error are statement-level; handled in gen_stmt + "display" | "newline" | "error" => { + // When called in expression position (inside an if branch, etc.), + // emit as a comma expression: (side_effect, 0) + return format!("({}, 0)", gen_display_stmt(&args[0])); + } + _ => {} + } + } + // General function call + let func_c = gen_expr(func); + let args_c: Vec = args.iter().map(gen_expr).collect(); + format!("{}({})", func_c, args_c.join(", ")) +} +``` + +Explain the "comma expression" trick for `display` in expression position: `(printf(...), 0)` is valid C — the comma operator evaluates both sides and returns the right-hand value (0 here, which acts as a placeholder integer). + +Note that the arity guarantees from §11 mean we can safely index `args[0]` and `args[1]` without bounds checking. + +### String escaping + +Show the `escape_for_c` helper that the string code path uses: + +```rust +fn escape_for_c(s: &str) -> String { + s.chars().flat_map(|c| match c { + '"' => vec!['\\', '"'], + '\\' => vec!['\\', '\\'], + '\n' => vec!['\\', 'n'], + '\t' => vec!['\\', 't'], + c => vec![c], + }).collect() +} +``` + +Use this instead of `escape_default()` which uses Rust escape syntax (`\u{...}`) that is not valid C. + +### Tests + +```rust +#[test] +fn test_gen_int() { + assert_eq!(gen_expr(&Expr::Int(42)), "42"); + assert_eq!(gen_expr(&Expr::Int(-7)), "-7"); +} + +#[test] +fn test_gen_add() { + let expr = Expr::Call { + func: Box::new(Expr::Symbol("+".into())), + args: vec![Expr::Int(1), Expr::Int(2)], + }; + assert_eq!(gen_expr(&expr), "(1 + 2)"); +} + +#[test] +fn test_gen_if() { + let expr = Expr::If { + cond: Box::new(Expr::Bool(true)), + then: Box::new(Expr::Int(1)), + else_: Box::new(Expr::Int(0)), + }; + assert_eq!(gen_expr(&expr), "(ML_TRUE ? 1 : 0)"); +} +``` + +## Style notes + +- Emphasise "C expressions only" at the top — this is the key architectural decision for this section +- Walk through each operator conversion explicitly; readers need to see the `=` → `==` translation noted +- The comma-expression trick for `display` in expression position is an interesting C technique — explain it clearly +- Note that the `panic!` for statement-level forms is a programming error guard, not a user-facing error diff --git a/edu/.nbd/tickets/1ef9f4.md b/edu/.nbd/tickets/1ef9f4.md new file mode 100644 index 0000000..edb6308 --- /dev/null +++ b/edu/.nbd/tickets/1ef9f4.md @@ -0,0 +1,80 @@ ++++ +title = "§10 Exercise 3: Semantic Document Search" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §10 Exercise 3 — Semantic Document Search — Stub to fill + +File: `edu/src/vector-db.md`, section `### 10. Exercise 3 — Semantic Document Search` + +Replace this stub line with the full exercise: +> **Goal:** Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4]. + +Follow the exercise format from `edu/src/markov.md`. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline. + +## Goal + +Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching. + +## Setup + +New project or extend vec-demo. Cargo.toml: +```toml +[dependencies] +libsql = "0.9" +fastembed = "4" +tokio = { version = "1", features = ["full"] } +``` + +Table schema uses `F32_BLOB(384)` (BGE-Small-EN-v1.5 output dimension): +```sql +CREATE TABLE IF NOT EXISTS docs ( + id INTEGER PRIMARY KEY, + passage TEXT NOT NULL, + embedding F32_BLOB(384) NOT NULL +) +``` + +## Corpus to use (15 passages across 3 topics) + +**Rust programming (5):** +- "Rust uses an ownership system to guarantee memory safety without a garbage collector." +- "The borrow checker enforces that references do not outlive the data they point to." +- "Cargo is Rust's build system and package manager, used to manage dependencies and run tests." +- "Rust's trait system enables zero-cost abstractions and compile-time polymorphism." +- "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently." + +**Astronomy (5):** +- "A black hole is a region of spacetime where gravity is so strong that nothing can escape." +- "The Milky Way galaxy contains an estimated 100 to 400 billion stars." +- "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei." +- "The cosmic microwave background is the thermal radiation left over from the early universe." +- "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity." + +**Cooking (5):** +- "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions." +- "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures." +- "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin." +- "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt." +- "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables." + +## Steps to cover + +**Step 1 — Embed the corpus.** Use fastembed to produce a `Vec>` for all 15 passages in one `model.embed()` call (batch is more efficient than one-at-a-time). + +**Step 2 — Insert into Turso.** Loop over passages and embeddings together. Format `Vec` as JSON string for `vector(?)`. Use `INSERT OR IGNORE` so re-running is idempotent. + +**Step 3 — Embed the query and search.** Embed a single query string (same model, `model.embed(vec![query], None)?`), then run `vector_top_k` with k=5 and join to get passage text and cosine distance. + +**Step 4 — Run three queries and verify results.** Verify the correct topic cluster surfaces: +- `"memory safety in systems programming"` → Rust passages +- `"stars and galaxies"` → astronomy passages +- `"fermentation and cooking techniques"` → cooking passages + +Print results ranked by distance with the passage text. + +## Reference solution + +Full self-contained `main.rs` inside `
`: creates table, embeds and inserts all 15 passages, runs three queries, prints results. \ No newline at end of file diff --git a/edu/.nbd/tickets/21d9be.md b/edu/.nbd/tickets/21d9be.md new file mode 100644 index 0000000..3aea70e --- /dev/null +++ b/edu/.nbd/tickets/21d9be.md @@ -0,0 +1,18 @@ ++++ +title = "§1 What Is a Vector?" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §1 What Is a Vector? — ALREADY COMPLETE + +This section is fully written in `edu/src/vector-db.md` under `### 1. What Is a Vector?`. No further content work is needed. Mark this ticket done. + +## What was written + +- Definition: a vector is an ordered list of numbers; each element is a coordinate along one axis of a space +- Geometric intuition in 2D and 3D: magnitude (`‖v‖ = √(Σ vᵢ²)`), direction, orthogonality via dot product +- High-dimensional spaces: the curse of dimensionality, normalisation onto the unit hypersphere, dimensions are not individually interpretable features +- Vectors as representations: the key insight that similarity in meaning corresponds to proximity in vector space — the basis of everything that follows +- Notation used throughout the course: **v**, `v[i]`, `‖v‖`, *d* for dimension, *n* for number of stored vectors \ No newline at end of file diff --git a/edu/.nbd/tickets/37cdd5.md b/edu/.nbd/tickets/37cdd5.md new file mode 100644 index 0000000..72fb454 --- /dev/null +++ b/edu/.nbd/tickets/37cdd5.md @@ -0,0 +1,21 @@ ++++ +title = "§6 Setting Up Turso + sqlite-vec" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §6 Setting Up Turso + sqlite-vec — ALREADY COMPLETE + +This section is fully written in `edu/src/vector-db.md` under `### 6. Setting Up`. No further content work is needed. Mark this ticket done. + +## What was written + +- Project setup: `cargo new vec-demo`, `libsql = "0.9"`, `tokio` with full features +- Release profile: `opt-level = "z"`, `lto = true`, `strip = true`, `codegen-units = 1` +- Opening a local connection: `Builder::new_local("vectors.db").build().await?` +- Verifying the connection with `SELECT sqlite_version()` +- Reference table of all sqlite-vec constructs used in later exercises: `F32_BLOB(d)`, `vector()`, `vector_extract()`, `vector_distance_cos()`, `libsql_vector_idx`, `vector_top_k()` +- Creating a `F32_BLOB(3)` vector table +- Creating an HNSW index with `USING libsql_vector_idx(embedding)` +- Complete working `main.rs` listing — produces "SQLite version: x.y.z" and "Database ready." \ No newline at end of file diff --git a/edu/.nbd/tickets/3aeb62.md b/edu/.nbd/tickets/3aeb62.md new file mode 100644 index 0000000..b86c564 --- /dev/null +++ b/edu/.nbd/tickets/3aeb62.md @@ -0,0 +1,101 @@ ++++ +title = "§3 Compiler Architecture: The Pipeline" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §3 Compiler Architecture: The Pipeline — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 3. Compiler Architecture: The Pipeline` + +Replace the stub line with full content. Target 500–700 words. Design overview — ASCII diagrams, brief stage descriptions, Rust module layout, error philosophy. No code yet. + +## Learning objectives + +- Understand the four compilation stages and what each produces +- Know the Rust type that flows between each stage +- Understand where errors originate and how they are reported +- See the module structure before writing any code + +## Content to write + +### Pipeline Diagram + +``` +Source text (&str) + │ + ▼ + ┌──────────┐ + │ Parser │ src/parser.rs + └──────────┘ + │ Vec + ▼ + ┌───────────────────┐ + │ Semantic Analyser │ src/analyser.rs + └───────────────────┘ + │ Vec (validated) + ▼ + ┌──────────────────┐ + │ Code Generator │ src/codegen.rs + └──────────────────┘ + │ String (C source) + ▼ + stdout / output file +``` + +### Stage Descriptions + +**Parser** (`src/parser.rs`). Accepts `&str` and produces `Vec`. Uses nom combinators. Fails on syntax errors: unmatched parentheses, invalid tokens, unexpected EOF. + +**Semantic Analyser** (`src/analyser.rs`). Walks `Vec` and checks: every symbol reference resolves to a definition, every special form has the correct shape and arity, lambda bodies are non-empty. Returns the same `Vec` on success; returns `CompileError` on failure. Does not do type inference — type errors surface as C compiler errors. + +**Code Generator** (`src/codegen.rs`). Walks validated `Vec` and produces a `String` of C source. This stage is pure — it cannot fail for valid input. Emits the preamble, forward declarations, and top-level forms in order. + +**Error type** (`src/error.rs`). A `CompileError` enum with variants for each stage. Uniform error handling across the pipeline. Each variant carries enough context for a useful message (e.g., the undefined symbol name). + +### Module Layout + +``` +src/ +├── main.rs # CLI: read input, call compile(), write output +├── ast.rs # Expr enum and Display impl +├── parser.rs # nom parsers → Vec +├── analyser.rs # scope checking and form validation +├── codegen.rs # AST → C string +└── error.rs # CompileError enum +``` + +### The `compile` Function + +Show the top-level function signature the reader will implement in §16: + +```rust +pub fn compile(source: &str) -> Result { + let exprs = parser::parse(source)?; + let exprs = analyser::analyse(exprs)?; + let c_source = codegen::generate(exprs); + Ok(c_source) +} +``` + +This makes explicit that parsing and analysis are fallible but code generation is not. + +### Error Reporting Philosophy + +The compiler reports the first error it encounters and stops. It does not attempt to recover and continue after a syntax error. nom's `cut` combinator is used at commit points to produce better error messages. A production compiler would collect multiple errors — this is a deliberate simplification. + +Errors include enough context to be actionable: +- Syntax errors: what character was unexpected and approximately where +- Semantic errors: the name of the undefined symbol or the malformed form + +### How Sections Map to the Diagram + +Tell the reader: Sections 4–9 fill in the parser box. Sections 10–11 fill in the analyser box. Sections 12–15 fill in the code generator box. Section 16 wires them together. + +## Style notes + +- Open with the pipeline diagram; it's the most information-dense single element in the section +- Keep prose tight — the diagram does the heavy lifting +- The `compile` function signature is the key insight: two fallible stages, one infallible stage diff --git a/edu/.nbd/tickets/3dc36b.md b/edu/.nbd/tickets/3dc36b.md new file mode 100644 index 0000000..a992ea9 --- /dev/null +++ b/edu/.nbd/tickets/3dc36b.md @@ -0,0 +1,131 @@ ++++ +title = "§5 Setting Up the Project" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §5 Setting Up the Project — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 5. Setting Up the Project` + +Replace the stub line with full content. Target 400–600 words. This is a pure hands-on section: create the project, add dependencies, lay out modules, verify it compiles. + +## Learning objectives + +- Know exactly which dependencies to add and why +- Have the module skeleton in place before any parsing logic is written +- Understand the release profile settings required by the project conventions + +## Content to write + +### Create the project + +```sh +cargo new minilisp --bin +cd minilisp +``` + +### Dependencies (`Cargo.toml`) + +```toml +[package] +name = "minilisp" +version = "0.1.0" +edition = "2021" + +[dependencies] +nom = "8" + +[profile.release] +opt-level = "z" +lto = true +strip = true +codegen-units = 1 +``` + +Explain each dependency: +- `nom = "8"` — the parser combinator library. Pinned to major version 8 because nom 8 introduced breaking API changes from nom 7. If the reader sees different combinator signatures in other resources, they may be looking at nom 7 documentation. + +### Module skeleton + +Create the following files with stub content (empty public functions or empty modules): + +**`src/main.rs`** +```rust +mod ast; +mod parser; +mod analyser; +mod codegen; +mod error; + +fn main() { + // §16: wire up the CLI here + println!("MiniLisp compiler"); +} +``` + +**`src/error.rs`** +```rust +#[derive(Debug)] +pub enum CompileError { + ParseError(String), + SemanticError(String), +} + +impl std::fmt::Display for CompileError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + CompileError::ParseError(msg) => write!(f, "parse error: {}", msg), + CompileError::SemanticError(msg) => write!(f, "semantic error: {}", msg), + } + } +} +``` + +**`src/ast.rs`** — stub (§7 will fill this in): +```rust +// Defined in §7 +``` + +**`src/parser.rs`** — stub: +```rust +// Defined in §8–9 +``` + +**`src/analyser.rs`** — stub: +```rust +// Defined in §10–11 +``` + +**`src/codegen.rs`** — stub: +```rust +// Defined in §12–15 +``` + +### Verify it compiles + +```sh +cargo check +``` + +Should produce no errors (warnings about unused items are expected). + +### Run the validation suite + +Introduce the four commands from the repo conventions that must pass before any commit: +```sh +cargo fmt +cargo check +cargo clippy +cargo test +``` + +Explain that `cargo test` will report zero tests at this stage — that is fine. As modules are filled in, the test count will grow. + +## Style notes + +- Show the complete `Cargo.toml` up front — readers frequently have version issues if they pick the wrong nom major version +- Keep the module stubs minimal — just enough for `cargo check` to pass +- End with a checkpoint: "you should now have a project that compiles" — sets a clear success criterion diff --git a/edu/.nbd/tickets/3e1250.md b/edu/.nbd/tickets/3e1250.md new file mode 100644 index 0000000..06d557c --- /dev/null +++ b/edu/.nbd/tickets/3e1250.md @@ -0,0 +1,125 @@ ++++ +title = "§12 The C Runtime Preamble" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §12 The C Runtime Preamble — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 12. The C Runtime Preamble` + +Replace the stub line with full content. Target 500–700 words. Design and write the C preamble string. Explain every line so the reader understands what the generated C file contains before their code begins. + +## Learning objectives + +- Understand what the preamble provides and why each part is needed +- Know the C type system used for MiniLisp values +- See the complete runtime helper functions for `display`, `newline`, `error` +- Understand name mangling: why all generated names are prefixed with `ml_` + +## Content to write + +### What the preamble does + +Every MiniLisp program compiles to a single C file. The preamble is a fixed block of C text emitted at the top of that file before any user-defined code. It provides: + +1. Standard library includes +2. Type definitions +3. Boolean constants +4. Runtime helper functions for built-ins + +### The complete preamble + +Present as a Rust `const` string: + +```rust +pub const PREAMBLE: &str = r#"#include +#include +#include +#include + +/* MiniLisp runtime types */ +typedef int64_t ml_int; +typedef int ml_bool; +typedef char* ml_str; + +#define ML_TRUE 1 +#define ML_FALSE 0 + +/* ml_display: print a value to stdout */ +static void ml_display_int(ml_int v) { printf("%ld", v); } +static void ml_display_bool(ml_bool v) { printf("%s", v ? "true" : "false"); } +static void ml_display_str(ml_str v) { printf("%s", v); } + +/* ml_newline: print a newline */ +static void ml_newline(void) { printf("\n"); } + +/* ml_error: print a message to stderr and exit */ +static void ml_error(ml_str msg) { + fprintf(stderr, "error: %s\n", msg); + exit(1); +} +"#; +``` + +Explain each section: + +**Type definitions.** `ml_int` is `int64_t` (64-bit signed integer). `ml_bool` is `int` (C does not have a native boolean type; 0 is false, non-zero is true — we use 1 for true). `ml_str` is `char*`. The `ml_` prefix prevents name collisions with any C standard library names. + +**Boolean constants.** `ML_TRUE` and `ML_FALSE` are `#define` macros. Arithmetic operations on booleans (e.g., `(+ #t 1)`) are undefined in MiniLisp but will silently work in the generated C — this is acceptable for a minimal compiler. + +**Display functions.** There are three variants because C does not have dynamic dispatch. The code generator picks the right variant based on the expression being displayed (or, since we have no type inference, it may emit `ml_display_int` for all non-string, non-bool expressions). Note: this design decision belongs to §13 — mention that the code generator chooses the variant. + +**`ml_error`.** Writes to stderr and calls `exit(1)`. The message is always a string literal in MiniLisp. + +**`static` linkage.** The helper functions are declared `static` to prevent symbol conflicts if the generated C file is ever linked with other files. + +### Name mangling + +All generated C identifiers for user-defined symbols are prefixed with `ml_` and have hyphens replaced with underscores (since `-` is not valid in a C identifier). For example: +- `factorial` → `ml_factorial` +- `my-var` → `ml_my_var` +- `even?` → `ml_even_p` (using `_p` suffix for `?`) +- `set!` → `ml_set_e` (using `_e` suffix for `!`) + +Present the mangling rules as a table and show the Rust function that performs the translation: + +```rust +/// Mangle a MiniLisp symbol name into a valid C identifier. +pub fn mangle(name: &str) -> String { + let mut result = String::from("ml_"); + for c in name.chars() { + match c { + '-' => result.push('_'), + '?' => result.push_str("_p"), + '!' => result.push_str("_e"), + c if c.is_alphanumeric() || c == '_' => result.push(c), + _ => result.push_str(&format!("_{:x}", c as u32)), + } + } + result +} +``` + +Explain: this function is used throughout `codegen.rs` whenever a symbol reference or definition needs to be emitted. + +### Emitting the preamble + +In `codegen.rs`: + +```rust +pub fn generate(exprs: Vec) -> String { + let mut out = String::new(); + out.push_str(PREAMBLE); + // ... rest of generation in §13–15 + out +} +``` + +## Style notes + +- Show the full preamble string as a verbatim block — readers need to see exactly what gets emitted +- The name-mangling table is the most referenceable thing in this section; present it prominently +- Note that the `display` overloading decision (three variants) is a simplification — a real Lisp runtime would use tagged unions or polymorphism diff --git a/edu/.nbd/tickets/4c961f.md b/edu/.nbd/tickets/4c961f.md new file mode 100644 index 0000000..88efea1 --- /dev/null +++ b/edu/.nbd/tickets/4c961f.md @@ -0,0 +1,63 @@ ++++ +title = "§9 Generating Embeddings in Rust" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §9 Generating Embeddings in Rust — Stub to fill + +File: `edu/src/vector-db.md`, section `### 9. Generating Embeddings in Rust` + +Replace this stub line with full section content: +> Before you can search by meaning, you need a way to convert text into vectors. [...] 🚧 Full content tracked in [nbd:4c961f]. + +This is a **reading lesson with short code snippets** — not a full exercise. The reader should finish with working embedding code they can use in §10–§12. Target 400–600 words. + +## Learning objectives + +- Know two approaches to generating embeddings in Rust: local model (fastembed) and HTTP API +- Understand how to call fastembed to embed strings locally with no API key +- Understand how to call an OpenAI-compatible embeddings endpoint +- Know which approach to use for the exercises and why (fastembed — offline, deterministic) +- Understand how embedding dimension affects the `F32_BLOB(d)` column type + +## Content to write + +**Option A — fastembed-rs (local, recommended for exercises).** No API key required, works offline, CPU-only, deterministic results. + +Cargo.toml addition: +```toml +fastembed = "4" +``` + +Basic usage (BGE-Small-EN-v1.5, 384 dimensions, ~130MB model downloaded once to `~/.cache/huggingface/hub/` on first run): +```rust +use fastembed::{TextEmbedding, InitOptions, EmbeddingModel}; + +let model = TextEmbedding::try_new( + InitOptions::new(EmbeddingModel::BGESmallENV15) + .with_show_download_progress(true), +)?; + +let docs = vec!["hello world", "Rust is fast"]; +let embeddings: Vec> = model.embed(docs, None)?; +// embeddings[0].len() == 384 +``` + +Batch embedding (passing multiple strings at once) is more efficient than embedding one at a time — use a single `model.embed()` call for the whole corpus. + +**Option B — HTTP API (OpenAI-compatible).** Higher quality models, requires API key and network access. + +Cargo.toml additions: +```toml +reqwest = { version = "0.12", features = ["json"] } +serde = { version = "1", features = ["derive"] } +serde_json = "1" +``` + +Show a minimal async function that POSTs to the embeddings endpoint and returns `Vec>`. Include the request/response struct definitions with serde derives. API key from `std::env::var("OPENAI_API_KEY")`. + +**Choosing between them.** Recommend fastembed for §10–§12 because: no API key, no network dependency, deterministic results (important for exercises), sub-100ms per batch on CPU. Use the HTTP approach when you need a specific production-grade model or have one already deployed. + +**Dimensionality note.** The `F32_BLOB(d)` column type must match the model's output dimension exactly — you cannot mix dimensions. Change the DDL from the toy `F32_BLOB(3)` used in §6–§8 to `F32_BLOB(384)` for BGE-Small, `F32_BLOB(768)` for all-MiniLM-L6-v2, or `F32_BLOB(1536)` for OpenAI text-embedding-3-small. The index must also be dropped and recreated if you change the dimension. \ No newline at end of file diff --git a/edu/.nbd/tickets/5674ce.md b/edu/.nbd/tickets/5674ce.md new file mode 100644 index 0000000..fe7d18e --- /dev/null +++ b/edu/.nbd/tickets/5674ce.md @@ -0,0 +1,65 @@ ++++ +title = "§8 Exercise 2: K-Nearest Neighbor Search" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §8 Exercise 2 — K-Nearest Neighbor Search — Stub to fill + +File: `edu/src/vector-db.md`, section `### 8. Exercise 2 — K-Nearest Neighbor Search` + +Replace this stub line with the full exercise: +> **Goal:** Use `vector_top_k` and `vector_distance_cos` [...] 🚧 Full content tracked in [nbd:5674ce]. + +Follow the exercise format from `edu/src/markov.md`. + +## Prerequisites (established in §7) + +Reader has the `vec-demo` project and has 6 rows in the `items` table: cat, dog, car, truck, python, rust with 3-dimensional embeddings. + +## Goal + +Given a query vector, use `vector_top_k` to find the 3 most similar items, join with the `items` table to retrieve labels and exact cosine distances, and display the results ranked by distance. + +## Steps to cover + +**Step 1 — Introduce `vector_top_k`.** Explain that this is a table-valued function (TVF) that returns row IDs of approximate nearest neighbours without a full table scan. Syntax: + +```sql +SELECT i.rowid FROM vector_top_k('items', vector(?), ?) i +``` + +The first argument is the table name (string literal), second is the query vector, third is k. Returns `rowid` values only — join to get other columns. + +**Step 2 — Full KNN query.** Show the complete query combining the TVF with a JOIN and exact distance computation: + +```sql +SELECT items.id, items.label, vector_distance_cos(items.embedding, vector(?)) AS dist +FROM vector_top_k('items', vector(?), ?) AS knn +JOIN items ON items.rowid = knn.rowid +ORDER BY dist ASC +``` + +Note: the query vector must be passed twice — once for `vector_top_k` (index traversal) and once for `vector_distance_cos` (exact distance). Both are the same JSON array string. + +**Step 3 — Run three queries and print results.** + +Query vectors to use: +- `[0.85, 0.15, 0.25]` → should be nearest cat and dog (animal cluster) +- `[0.15, 0.85, 0.15]` → should be nearest car and truck (vehicle cluster) +- `[0.1, 0.05, 0.92]` → should be nearest rust and python (language cluster) + +Expected output format: +``` +Query: [0.85, 0.15, 0.25] + 1. cat dist=0.0023 + 2. dog dist=0.0089 + 3. python dist=0.1834 +``` + +**Step 4 — Explain ANN vs. exact search.** For 6 rows, `vector_top_k` falls back to exact search anyway — the HNSW index has too few nodes to offer a shortcut. Note that at scale (millions of rows), it returns approximate results; some true nearest neighbours may be missed. `vector_distance_cos` always gives the exact distance for any specific pair. + +## Reference solution + +Full `main.rs` inside `
Show full solution`. The solution should re-run setup from §7 (create table, insert data) then run the three KNN queries. \ No newline at end of file diff --git a/edu/.nbd/tickets/5835e9.md b/edu/.nbd/tickets/5835e9.md new file mode 100644 index 0000000..d98e438 --- /dev/null +++ b/edu/.nbd/tickets/5835e9.md @@ -0,0 +1,166 @@ ++++ +title = "§4 Introduction to nom: Parser Combinators" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §4 Introduction to nom: Parser Combinators — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 4. Introduction to nom: Parser Combinators` + +Replace the stub line with full content. Target 900–1200 words. This is the conceptual and practical foundation for all parsing in the course. The reader needs to understand nom well enough to write parsers without hand-holding by §8. + +## Learning objectives + +- Understand what a parser combinator is and why it is better than hand-rolling a recursive descent parser for our purposes +- Understand `IResult` and what its three variants mean +- Know and be able to use: `tag`, `char`, `alpha1`, `digit1`, `multispace0`, `alt`, `many0`, `map`, `map_res`, `tuple`, `delimited`, `preceded`, `terminated`, `opt`, `recognize`, `verify`, `cut` +- Know how to write a parser function, call it, and test it +- Know how to use the `ws` whitespace-wrapper pattern + +## Content to write + +### What is a parser combinator? + +A parser combinator is a function that takes one or more parsers and returns a new parser. Individual parsers handle small fragments of input; combinators compose them into larger parsers. The result is a parser written entirely in the host language (Rust), with no grammar files, no code generation, and no build-time magic. + +Contrast with traditional parser generators (ANTLR, yacc): those require a separate grammar file, a code-generation step, and often a bespoke DSL for semantic actions. nom parsers are plain Rust functions. + +### The `IResult` Type + +```rust +type IResult> = Result<(I, O), nom::Err>; +``` + +On success: `Ok((remaining_input, output))`. The parser consumed some input and produced a value; `remaining_input` is whatever was left. + +On failure (recoverable): `Err(nom::Err::Error(e))`. The parser tried and failed; the caller can try an alternative. + +On failure (unrecoverable): `Err(nom::Err::Failure(e))`. The parser is committed — no alternatives should be tried. Triggered by `cut`. + +The key insight: parsers return the *remaining* input. This is what makes composition work — one parser's remaining output is the next parser's input. + +### Writing a Parser + +Show the anatomy of a parser function: + +```rust +use nom::{IResult, bytes::complete::tag}; + +fn parse_hello(input: &str) -> IResult<&str, &str> { + tag("hello")(input) +} + +#[test] +fn test_parse_hello() { + assert_eq!(parse_hello("hello world"), Ok((" world", "hello"))); + assert!(parse_hello("goodbye").is_err()); +} +``` + +### Essential Combinators + +Work through each combinator with a small standalone example: + +**`tag(s)`** — match a literal string. +```rust +tag("(")(input) // matches the literal "(" +``` + +**`char(c)`** — match a single character. +```rust +char('(')(input) +``` + +**`alpha1`, `digit1`, `alphanumeric1`** — match one or more letters/digits/alphanumerics. + +**`multispace0`, `multispace1`** — match zero/one or more whitespace characters. + +**`alt((p1, p2, ...))`** — try each parser in order; return the first success. +```rust +alt((tag("true"), tag("false")))(input) +``` + +**`many0(p)`** — apply `p` zero or more times; return `Vec`. + +**`map(p, f)`** — transform a parser's output. +```rust +map(digit1, |s: &str| s.parse::().unwrap()) +``` + +**`map_res(p, f)`** — like `map` but `f` returns `Result`; propagates errors. +```rust +map_res(digit1, |s: &str| s.parse::()) +``` + +**`tuple((p1, p2, ...))`** — run parsers in sequence; collect outputs as a tuple. + +**`delimited(open, inner, close)`** — parse `open`, `inner`, `close`; return only `inner`'s output. Perfect for parenthesised expressions. +```rust +delimited(char('('), inner_parser, char(')'))(input) +``` + +**`preceded(prefix, inner)`** — parse `prefix` then `inner`; return only `inner`. + +**`terminated(inner, suffix)`** — parse `inner` then `suffix`; return only `inner`. + +**`opt(p)`** — make `p` optional; returns `Option`. + +**`recognize(p)`** — run `p` but return the input slice it consumed rather than its output. Useful for building string slices from composed parsers. + +**`verify(p, pred)`** — run `p`, then apply predicate `pred`; fail if predicate returns false. + +**`cut(p)`** — mark this branch as committed; convert recoverable errors into unrecoverable ones. Use after a discriminating tag (e.g., after matching `(define`, commit to parsing a define form). + +### The `ws` Combinator Pattern + +Whitespace appears between any two tokens in Lisp. Define a helper that strips whitespace before and after any parser: + +```rust +use nom::{Parser, IResult, character::complete::multispace0, sequence::delimited}; +use nom::error::ParseError; + +pub fn ws<'a, O, E, F>(inner: F) -> impl Parser<&'a str, Output = O, Error = E> +where + E: ParseError<&'a str>, + F: Parser<&'a str, Output = O, Error = E>, +{ + delimited(multispace0, inner, multispace0) +} +``` + +### Testing parsers + +Show the pattern: use `assert_eq!` on `Ok((remaining, output))` for success cases, `assert!(result.is_err())` for failure cases. Note that remaining input is part of the assertion — it is easy to accidentally under-consume. + +### nom 8 API note + +nom 8 changed the parser API: combinators now return types that implement `Parser` rather than closures. Call `.parse(input)` on them, or pass input directly as `combinator(args)(input)`. The `Parser` trait is in scope with `use nom::Parser;`. Reference: [nom changelog](https://github.com/rust-bakery/nom/blob/main/CHANGELOG.md). + +## Key references + +- nom README: https://github.com/rust-bakery/nom +- `nom::bytes::complete` module (tag, take_while, take_until, is_not) +- `nom::character::complete` module (char, alpha1, digit1, multispace0) +- `nom::sequence` module (delimited, preceded, terminated, tuple, pair) +- `nom::multi` module (many0, many1, separated_list0) +- `nom::combinator` module (map, map_res, opt, recognize, verify, cut, value) +- `nom::branch` module (alt) +- Recipes: https://github.com/rust-bakery/nom/blob/main/doc/nom_recipes.md + +## Exercises to include + +1. Write a parser for `#t` and `#f` booleans using `alt` and `tag` +2. Write a parser for a C-style identifier (starts with letter or `_`, then alphanumeric or `_`) +3. Write a parser for a decimal integer using `recognize`, `opt(char('-'))`, and `digit1` +4. Compose the above three into an `alt` that returns a string slice matching any of them + +Each exercise should have a collapsible reference solution. + +## Style notes + +- Introduce `IResult` before showing any combinator — readers need to understand the return type to understand what combinators are doing +- Show every combinator with a working code snippet, not just a description +- Make the `ws` wrapper a "save this — you will use it throughout" moment diff --git a/edu/.nbd/tickets/584e0c.md b/edu/.nbd/tickets/584e0c.md new file mode 100644 index 0000000..552f2a3 --- /dev/null +++ b/edu/.nbd/tickets/584e0c.md @@ -0,0 +1,37 @@ ++++ +title = "§2 Embeddings" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §2 Embeddings — Stub to fill + +File: `edu/src/vector-db.md`, section `### 2. Embeddings` + +Replace this stub line with full content: +> Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c]. + +This is a **reading lesson** — no Rust code. Target 400–700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases. + +## Learning objectives + +- Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors +- Know why geometric proximity in embedding space corresponds to semantic similarity (training objective) +- Understand how contextual encoder models (BERT-style) produce sentence embeddings +- Know typical output dimensionalities (384, 768, 1536) and what influences them +- Understand that embedding axes are not individually interpretable — meaning lives in relative positions + +## Content to write + +**What an embedding is.** A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart. + +**Word embeddings (brief history).** Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king − man + woman ≈ queen). These assign one vector per word regardless of context. + +**Contextual embeddings from encoder models.** Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a `Vec` of fixed length. The same word in different contexts produces different vectors. + +**What makes a good embedding model.** Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost. + +**Practical dimensionalities.** 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset. + +**Embeddings for non-text data.** Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality. \ No newline at end of file diff --git a/edu/.nbd/tickets/58b37a.md b/edu/.nbd/tickets/58b37a.md new file mode 100644 index 0000000..b91359b --- /dev/null +++ b/edu/.nbd/tickets/58b37a.md @@ -0,0 +1,162 @@ ++++ +title = "§16 The Compilation Pipeline" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §16 The Compilation Pipeline — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 16. The Compilation Pipeline` + +Replace the stub line with full content. Target 600–800 words. Wire all stages into a working CLI binary. Trace the complete factorial example end-to-end. + +## Learning objectives + +- Implement the `compile` function that chains parse → analyse → generate +- Write a CLI `main.rs` that reads from a file or stdin and writes to stdout +- Handle and display errors from any stage +- Demonstrate the complete workflow: `.lisp` → `.c` → compile → run + +## Content to write + +### The `compile` function + +In `src/main.rs` (or a `src/lib.rs`): + +```rust +pub fn compile(source: &str) -> Result { + let exprs = parser::parse(source)?; + let exprs = analyser::analyse(exprs)?; + let c = codegen::generate(exprs); + Ok(c) +} +``` + +### Handling top-level non-define expressions + +The code generator in §14 skipped top-level expressions that are not `define` forms (e.g., `(display (factorial 10))`). These must be emitted inside a C `main` function. Complete the `generate` function: + +```rust +pub fn generate(exprs: Vec) -> String { + let mut out = String::new(); + out.push_str(PREAMBLE); + + // Forward declarations + for expr in &exprs { + if let Expr::Define { name, value } = expr { + out.push_str(&gen_forward_decl(name, value)); + } + } + out.push('\n'); + + // Function and variable definitions + for expr in &exprs { + if let Expr::Define { name, value } = expr { + match value.as_ref() { + Expr::Lambda { params, body } => + out.push_str(&gen_function_def(name, params, body)), + _ => + out.push_str(&gen_variable_def(name, value)), + } + } + } + + // main(): emit top-level non-define expressions + out.push_str("\nint main(void) {\n"); + for expr in &exprs { + if !matches!(expr, Expr::Define { .. }) { + out.push_str(&format!(" {};\n", gen_stmt(expr))); + } + } + out.push_str(" return 0;\n}\n"); + + out +} +``` + +### The CLI: `src/main.rs` + +```rust +use std::{env, fs, io::{self, Read}, process}; + +fn main() { + let args: Vec = env::args().collect(); + let source = match args.get(1) { + Some(path) => fs::read_to_string(path).unwrap_or_else(|e| { + eprintln!("error reading {}: {}", path, e); + process::exit(1); + }), + None => { + let mut buf = String::new(); + io::stdin().read_to_string(&mut buf).unwrap(); + buf + } + }; + + match compile(&source) { + Ok(c_source) => print!("{}", c_source), + Err(e) => { + eprintln!("{}", e); + process::exit(1); + } + } +} +``` + +Explain: if a file path is given as `argv[1]`, read from it; otherwise read from stdin. Always write C to stdout. This allows `minilisp factorial.lisp > factorial.c`. + +### The end-to-end workflow + +Walk through the complete factorial example step by step: + +```sh +# 1. Write a MiniLisp program +cat > factorial.lisp <<'EOF' +(define (factorial n) + (if (= n 0) + 1 + (* n (factorial (- n 1))))) + +(display (factorial 10)) +(newline) +EOF + +# 2. Compile to C +cargo run -- factorial.lisp > factorial.c + +# 3. Compile the C +cc -o factorial factorial.c + +# 4. Run +./factorial +# Output: 3628800 +``` + +Show the generated `factorial.c` in full so the reader can verify the output looks correct. + +### Error handling demo + +Show what happens when the compiler rejects invalid input: + +```sh +echo "(define (f x) (g x))" | cargo run # g is undefined +# stderr: semantic error: undefined symbol: `g` +# exit code: 1 +``` + +### Build and validate + +```sh +cargo fmt && cargo check && cargo clippy && cargo test +``` + +All should pass. This is the project's first fully working state. + +## Style notes + +- The end-to-end workflow walkthrough is the climax of the implementation sections — give it space +- Show the generated C in full; readers deserve to see the fruit of their work +- The error demo is quick but important — confirm the pipeline fails gracefully +- End with a note of congratulation: the reader has just built a compiler diff --git a/edu/.nbd/tickets/5ed295.md b/edu/.nbd/tickets/5ed295.md new file mode 100644 index 0000000..63c604d --- /dev/null +++ b/edu/.nbd/tickets/5ed295.md @@ -0,0 +1,90 @@ ++++ +title = "§12 Exercise 5: Retrieval-Augmented Generation" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §12 Exercise 5 — Retrieval-Augmented Generation — Stub to fill + +File: `edu/src/vector-db.md`, section `### 12. Exercise 5 — Retrieval-Augmented Generation` + +Replace this stub line with the full exercise: +> **Goal:** Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline [...] 🚧 Full content tracked in [nbd:5ed295]. + +Follow the exercise format from `edu/src/markov.md`. This is the capstone exercise — it combines Turso vector search (§7–§8), fastembed (§9), and semantic search (§10), adding an LLM API call to ground answers in retrieved context. + +## Goal + +1. Store the 15-passage corpus from §10 in Turso +2. Accept a natural-language question +3. Retrieve the top-3 most relevant passages using vector KNN +4. Inject the passages into a prompt as context +5. Send the prompt to an OpenAI-compatible LLM API +6. Print the grounded answer + +## Setup + +```toml +[dependencies] +libsql = "0.9" +fastembed = "4" +reqwest = { version = "0.12", features = ["json"] } +serde = { version = "1", features = ["derive"] } +serde_json = "1" +tokio = { version = "1", features = ["full"] } +``` + +API key from environment: `std::env::var("OPENAI_API_KEY")`. Tell the reader they can use any OpenAI-compatible provider (OpenAI, Groq, Together AI, or local Ollama with base URL `http://localhost:11434/v1` and model `llama3.2`). + +## Steps to cover + +**Step 1 — Retrieval function.** Reuse the semantic search logic from §10. Signature: + +```rust +async fn retrieve( + conn: &libsql::Connection, + model: &TextEmbedding, + query: &str, + k: usize, +) -> Result, Box> +``` + +Returns the top-k passage texts ordered by cosine distance. + +**Step 2 — Prompt construction.** Build a prompt string: + +``` +You are a helpful assistant. Answer the question using only the provided context. +If the context does not contain enough information, say so. + +Context: +[passage 1] + +[passage 2] + +[passage 3] + +Question: {question} + +Answer: +``` + +**Step 3 — LLM API call.** POST to `https://api.openai.com/v1/chat/completions` with model `gpt-4o-mini`. Show the request/response structs with serde derives and return the `content` of the first choice message. Use `reqwest::Client` with a bearer token Authorization header. + +**Step 4 — Wire it together and run.** Three example questions using the §10 corpus: +- `"How does Rust ensure memory safety?"` → should answer using Rust passages +- `"What is a black hole?"` → should answer using astronomy passages +- `"What is the Maillard reaction?"` → should answer using cooking passages + +Print the retrieved passages first (so the reader can see what context was used), then the LLM's answer. + +**Step 5 — Discussion: RAG patterns.** After the reference solution, add a prose section (not numbered steps) covering: +- Chunk size and overlap: why long documents are split into overlapping passages before embedding +- Re-ranking: a cross-encoder can re-rank the top-k ANN results for better precision +- Hybrid search: combining BM25 (keyword) and ANN (semantic) often outperforms either alone +- Context window limits: number of passages to inject depends on the model's context length and passage length + +## Reference solution + +Full `main.rs` inside `
`. Keep `retrieve`, `build_prompt`, and `call_llm` as clearly named separate functions. The `main` function should be a thin orchestrator. \ No newline at end of file diff --git a/edu/.nbd/tickets/67e284.md b/edu/.nbd/tickets/67e284.md new file mode 100644 index 0000000..c9feded --- /dev/null +++ b/edu/.nbd/tickets/67e284.md @@ -0,0 +1,54 @@ ++++ +title = "Course: Writing a Lisp-to-C Compiler in Rust" +priority = 5 +status = "todo" +ticket_type = "project" +dependencies = ["e8da8b", "a93829", "3aeb62", "5835e9", "3dc36b", "685f5e", "a1a827", "b6c9ad", "a4c9f8", "d0b9f8", "6d40a7", "3e1250", "1eb794", "cbc6e3", "de82f1", "58b37a", "8fa47a", "1d16da"] ++++ + +## Course: Writing a Lisp-to-C Compiler in Rust + +A complete self-guided interactive course teaching how to build a compiler from scratch in Rust using the nom parser-combinator library. The source language is MiniLisp — a minimal Lisp dialect. The compilation target is human-readable C. + +## Course file + +`edu/src/lisp-compiler.md` + +## Section inventory + +| § | Title | Ticket | +|---|---|---| +| 1 | Introduction: What We're Building | e8da8b | +| 2 | MiniLisp Language Specification | a93829 | +| 3 | Compiler Architecture: The Pipeline | 3aeb62 | +| 4 | Introduction to nom: Parser Combinators | 5835e9 | +| 5 | Setting Up the Project | 3dc36b | +| 6 | Recognizing Atoms: Integers, Booleans, Strings, Symbols | 685f5e | +| 7 | The Abstract Syntax Tree | a1a827 | +| 8 | Parsing Atoms with nom | b6c9ad | +| 9 | Parsing S-Expressions and Special Forms | a4c9f8 | +| 10 | Symbol Tables and Scope | d0b9f8 | +| 11 | Checking Special Forms | 6d40a7 | +| 12 | The C Runtime Preamble | 3e1250 | +| 13 | Generating C: Atoms and Expressions | 1eb794 | +| 14 | Generating C: Definitions and Functions | cbc6e3 | +| 15 | Generating C: Control Flow and Sequencing | de82f1 | +| 16 | The Compilation Pipeline | 58b37a | +| 17 | Testing the Compiler | 8fa47a | +| 18 | What's Next: Extensions and Further Reading | 1d16da | + +## MiniLisp feature set + +- **Types:** integers (`int64_t`), booleans (`#t`/`#f`), strings (`const char*`) +- **Special forms:** `define`, `lambda`, `if`, `let`, `begin` +- **Built-in operators:** `+`, `-`, `*`, `/`, `=`, `<`, `>`, `<=`, `>=`, `not` +- **Built-in functions:** `display`, `newline`, `error` +- **Comments:** `;` to end of line + +## Explicitly out of scope + +Closures, tail-call optimisation, pairs/lists, garbage collection, macros, variadic functions, floating-point. Each is discussed as a potential extension in §18. + +## Definition of done + +All 18 section tickets are `done` and `mdbook build` succeeds with no 🚧 stubs remaining in the output. diff --git a/edu/.nbd/tickets/685f5e.md b/edu/.nbd/tickets/685f5e.md new file mode 100644 index 0000000..4a2749c --- /dev/null +++ b/edu/.nbd/tickets/685f5e.md @@ -0,0 +1,166 @@ ++++ +title = "§6 Recognizing Atoms: Integers, Booleans, Strings, Symbols" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §6 Recognizing Atoms: Integers, Booleans, Strings, Symbols — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 6. Recognizing Atoms: Integers, Booleans, Strings, Symbols` + +Replace the stub line with full content. Target 800–1100 words. This is a hands-on section that builds one atom parser at a time. Each parser is developed in isolation before being combined in §8. + +## Learning objectives + +- Write a nom parser for each MiniLisp atom type +- Use `map_res`, `recognize`, `opt`, `alt`, `tag`, `char`, `take_while1`, `is_not`, `escaped_transform` +- Understand how to test parsers with `assert_eq!` on the full `IResult` +- Know the tricky cases: negative integers vs symbol `-`, `#t`/`#f` ambiguity, string escapes + +## Content to write + +Work through each atom parser in a subsection with: explanation, full code, tricky cases, and a test block. + +### Integer parser + +A signed decimal integer: optional `-`, then one or more digits, converted to `i64`. + +```rust +use nom::{IResult, combinator::{map_res, recognize, opt}, character::complete::{char, digit1}, sequence::pair}; + +pub fn parse_integer(input: &str) -> IResult<&str, i64> { + map_res( + recognize(pair(opt(char('-')), digit1)), + |s: &str| s.parse::() + )(input) +} +``` + +Tricky case: the symbol `-` and negative integers. Because `opt(char('-'))` allows a lone `-`, `parse_integer("-")` will try to parse `-` as an integer and fail at `map_res` (because `"-"` does not parse as i64). This is fine — the failure is recoverable and `alt` in the atom parser will fall through to the symbol parser. However, this means the integer parser must be tried *before* the symbol parser in the `alt`. + +Tests: +```rust +assert_eq!(parse_integer("42 rest"), Ok((" rest", 42))); +assert_eq!(parse_integer("-7"), Ok(("", -7))); +assert!(parse_integer("abc").is_err()); +``` + +### Boolean parser + +```rust +use nom::{IResult, branch::alt, bytes::complete::tag, combinator::value}; + +pub fn parse_bool(input: &str) -> IResult<&str, bool> { + alt(( + value(true, tag("#t")), + value(false, tag("#f")), + ))(input) +} +``` + +Explain `value(output, parser)` — discards the parser's output and returns a fixed value instead. This avoids a `map` that ignores its argument. + +Tricky case: `#t` and `#f` must not be valid symbol characters, otherwise a symbol starting with `#` would be ambiguous. Confirm that `#` is not in the symbol character set (per §2). + +### Symbol parser + +Symbols start with a `sym_start` character and continue with zero or more `sym_cont` characters. Use `recognize` to return the input slice. + +```rust +use nom::{IResult, combinator::recognize, sequence::pair, + character::complete::{alpha1, alphanumeric1}, + bytes::complete::take_while1, branch::alt}; + +fn is_sym_start(c: char) -> bool { + c.is_alphabetic() || "-_?!+*/=<>".contains(c) +} + +fn is_sym_cont(c: char) -> bool { + c.is_alphanumeric() || "-_?!+*/=<>".contains(c) +} + +pub fn parse_symbol(input: &str) -> IResult<&str, &str> { + recognize(pair( + nom::bytes::complete::take_while_m_n(1, 1, is_sym_start), + nom::bytes::complete::take_while(is_sym_cont), + ))(input) +} +``` + +Tricky case: `+`, `*`, `/`, `=`, `<`, `>` are valid single-character symbols (used as operator names). The parser must handle them. + +Tests: +```rust +assert_eq!(parse_symbol("my-var rest"), Ok((" rest", "my-var"))); +assert_eq!(parse_symbol("+"), Ok(("", "+"))); +assert!(parse_symbol("42").is_err()); +``` + +### String parser + +Double-quoted strings with escape sequences `\"`, `\\`, `\n`, `\t`. + +```rust +use nom::{IResult, bytes::complete::{tag, is_not}, sequence::delimited, + combinator::map, branch::alt}; +use nom::bytes::complete::escaped_transform; +use nom::character::complete::char; + +pub fn parse_string(input: &str) -> IResult<&str, String> { + delimited( + char('"'), + escaped_transform( + is_not("\\\""), + '\\', + alt(( + map(char('"'), |_| "\""), + map(char('\\'), |_| "\\"), + map(char('n'), |_| "\n"), + map(char('t'), |_| "\t"), + )) + ), + char('"'), + )(input) +} +``` + +Note: `escaped_transform` returns `String` (owned), not `&str`, because it must allocate when escape sequences are expanded. + +Tricky case: an empty string `""` — `is_not` requires at least one character. Test it explicitly. + +Tests: +```rust +assert_eq!(parse_string(r#""hello""#), Ok(("", "hello".to_string()))); +assert_eq!(parse_string(r#""a\nb""#), Ok(("", "a\nb".to_string()))); +assert_eq!(parse_string(r#""""#), Ok(("", "".to_string()))); +``` + +### Comment parser + +Comments are consumed and discarded — they produce no AST node. + +```rust +use nom::{IResult, bytes::complete::is_not, sequence::pair, + character::complete::{char, line_ending}, combinator::opt, + combinator::value}; + +pub fn parse_comment(input: &str) -> IResult<&str, ()> { + value((), pair(char(';'), opt(is_not("\n\r"))))(input) +} +``` + +## Exercises + +1. Extend the integer parser to also recognise hexadecimal literals prefixed with `0x` — use `alt` and `map_res` with `i64::from_str_radix`. +2. Extend the symbol parser to reject the single character `-` followed immediately by a digit (since that should be parsed as a negative integer). + +Both exercises should have collapsible reference solutions. + +## Style notes + +- One subsection per atom type, in the order they will appear in the `alt` in §8 +- Every code block must be self-contained with `use` statements +- Show tricky cases and why they are tricky before showing the solution — the reader should understand the pitfall, not just copy the fix +- nom version note: use `nom::bytes::complete` (not `nom::bytes::streaming`) throughout diff --git a/edu/.nbd/tickets/6d40a7.md b/edu/.nbd/tickets/6d40a7.md new file mode 100644 index 0000000..197e8c9 --- /dev/null +++ b/edu/.nbd/tickets/6d40a7.md @@ -0,0 +1,123 @@ ++++ +title = "§11 Checking Special Forms" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §11 Checking Special Forms — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 11. Checking Special Forms` + +Replace the stub line with full content. Target 500–700 words. Extend the analyser to validate the shape and arity of each special form. Moderate code, conceptually straightforward. + +## Learning objectives + +- Understand what "shape checking" means for special forms +- Add arity and constraint checks for each special form in the analyser +- Produce actionable error messages that identify the problematic form + +## Content to write + +### Why the parser isn't enough + +The parser already enforces some structure — `parse_if` requires exactly `(if cond then else)`. But because we use `cut` at certain points and `many0`/`many1` at others, some invalid inputs may slip through to produce odd AST nodes. Shape checking in the analyser catches these and provides better error messages than a nom parse failure would. + +More importantly, shape constraints that are hard to encode in a parser (like "a lambda must have at least one body expression") are cleanly expressed as analyser rules. + +### Checks to add + +Extend `check_expr` from §10 with these constraints: + +**`Define`**: the `value` expression must not itself be another bare `Define` (nested defines are disallowed in MiniLisp — only top-level defines are valid). Emit: `"define is only allowed at the top level"`. + +**`Lambda`**: body must be non-empty (guaranteed by `many1` in the parser, but double-check). Parameters must be unique — duplicate parameter names are an error. Emit: `"duplicate parameter name: \`{name}\`"`. + +**`If`**: no checks beyond what the parser enforced (exactly three sub-expressions). This is already correct by construction from the AST variant. + +**`Let`**: binding names must be unique within the `let` form. Each binding value is evaluated in the *outer* scope (not the let scope) — confirm that binding values reference the outer env, not the inner one. + +**`Call` with built-in operators**: check arity. +- Binary operators (`+`, `-`, `*`, `/`, `=`, `<`, `>`, `<=`, `>=`): exactly 2 arguments. +- Unary operator (`not`): exactly 1 argument. +- `display`: exactly 1 argument. +- `newline`: exactly 0 arguments. +- `error`: exactly 1 argument (the message string). + +```rust +fn check_call_arity(func: &Expr, args: &[Expr]) -> Result<(), CompileError> { + let name = match func { + Expr::Symbol(s) => s.as_str(), + _ => return Ok(()), // user-defined function call; arity checked at runtime + }; + let expected = match name { + "+" | "-" | "*" | "/" | "=" | "<" | ">" | "<=" | ">=" => Some(2), + "not" | "display" | "error" => Some(1), + "newline" => Some(0), + _ => None, // user-defined; no static arity check + }; + if let Some(n) = expected { + if args.len() != n { + return Err(CompileError::SemanticError(format!( + "`{}` expects {} argument(s), got {}", + name, n, args.len() + ))); + } + } + Ok(()) +} +``` + +Call `check_call_arity` inside the `Expr::Call` arm of `check_expr`. + +### Disallowing nested defines + +Top-level `Define` inside a function body should be rejected: + +```rust +fn check_expr_in_body(expr: &Expr, env: &Env) -> Result<(), CompileError> { + if let Expr::Define { .. } = expr { + return Err(CompileError::SemanticError( + "define is only allowed at the top level".to_string() + )); + } + check_expr(expr, env) +} +``` + +Use `check_expr_in_body` when checking lambda bodies, let bodies, and begin expressions. + +### Unit tests + +```rust +#[test] +fn test_duplicate_params_rejected() { + let exprs = parse("(define (f x x) x)").unwrap(); + assert!(analyse(exprs).is_err()); +} + +#[test] +fn test_wrong_arity_rejected() { + let exprs = parse("(+ 1 2 3)").unwrap(); + assert!(analyse(exprs).is_err()); +} + +#[test] +fn test_nested_define_rejected() { + let exprs = parse("(define (f x) (define y 1) y)").unwrap(); + assert!(analyse(exprs).is_err()); +} + +#[test] +fn test_valid_program_passes() { + let exprs = parse("(define (factorial n) (if (= n 0) 1 (* n (factorial (- n 1)))))").unwrap(); + assert!(analyse(exprs).is_ok()); +} +``` + +## Style notes + +- This section is shorter than §10 — it is an extension, not a redesign +- The built-in arity table is the most useful part; present it as a reference table in the text +- End with a checkpoint: run the full analyser on the factorial example and verify it passes diff --git a/edu/.nbd/tickets/6ec5ff.md b/edu/.nbd/tickets/6ec5ff.md new file mode 100644 index 0000000..4332eef --- /dev/null +++ b/edu/.nbd/tickets/6ec5ff.md @@ -0,0 +1,61 @@ ++++ +title = "§5 Under the Hood: ANN Algorithms" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §5 Under the Hood: ANN Algorithms — Stub to fill + +File: `edu/src/vector-db.md`, section `### 5. Under the Hood: ANN Algorithms` + +Replace this stub line with full content: +> Exact nearest-neighbour search over millions of high-dimensional vectors is too slow [...] 🚧 Full content tracked in [nbd:6ec5ff]. + +This is a **reading lesson** — no Rust code. Target 600–800 words. Include the summary table below. + +## Learning objectives + +- Understand why exact KNN is impractical at scale (O(n·d) per query) +- Understand how HNSW works conceptually (multi-level navigable graph, greedy search) +- Understand how IVFFlat works conceptually (k-means clustering, inverted index) +- Know the key tuning parameters for each and what they control +- Understand the recall vs. latency trade-off +- Know that sqlite-vec uses HNSW via `libsql_vector_idx` + +## Content to write + +**Why not exact search?** Brute-force KNN computes distance from the query to every stored vector: O(n·d) per query. At n=1M vectors, d=768 dimensions, and 1000 QPS this is ~768B operations/second — infeasible on a CPU. ANN algorithms find approximate results in O(log n) or sub-linear time at the cost of occasionally missing a few true nearest neighbours. + +**HNSW — Hierarchical Navigable Small World.** The dominant algorithm for in-memory ANN, used by sqlite-vec. + +Intuition: imagine a multi-level skip list where each level is a proximity graph. The top level is sparse with long-range connections (fast coarse navigation). The bottom level is dense with short-range connections (precise local search). A query starts at the top, greedily moves to whichever neighbour is closest to the query, descends when stuck, and repeats down to the bottom level where the k nearest candidates are collected. + +Key parameters: +- `M`: number of bidirectional connections per node. Higher M → better recall, more memory, slower inserts. Typical: 16. +- `ef_construction`: candidate list size during index build. Higher → better index quality, slower build. Typical: 200. +- `ef_search`: candidate list size during query. Higher → better recall, slower query. Often defaults to k. + +HNSW supports incremental inserts with no full rebuild. Memory cost is O(n·M·4 bytes). + +**IVFFlat — Inverted File with flat quantisation.** The dominant approach for disk-based or GPU-accelerated ANN (used by Faiss, pgvector default). + +Intuition: cluster the dataset into `nlist` Voronoi cells using k-means. At query time, find the `nprobe` nearest cell centroids, then do exact search within those cells only — skipping the rest of the dataset. + +Key parameters: +- `nlist`: number of clusters. Typical: √n. +- `nprobe`: number of clusters searched at query time. Higher → better recall, slower query. + +IVFFlat requires a training step before data can be inserted. Incremental inserts require reassigning to clusters (or periodic retraining). Lower memory than HNSW for the same n. + +**sqlite-vec uses HNSW.** The `libsql_vector_idx` index type creates an HNSW index — which is why §6 can insert rows incrementally without a training step. The current API does not expose M or ef parameters; defaults are chosen for broad applicability. + +**Summary table.** + +| Property | HNSW | IVFFlat | +|---|---|---| +| Query time | O(log n) | O(nprobe · n/nlist) | +| Insert | Incremental | Batch (requires training) | +| Memory | Higher (graph edges) | Lower | +| Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on nprobe) | +| Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss | \ No newline at end of file diff --git a/edu/.nbd/tickets/8fa47a.md b/edu/.nbd/tickets/8fa47a.md new file mode 100644 index 0000000..0b782dc --- /dev/null +++ b/edu/.nbd/tickets/8fa47a.md @@ -0,0 +1,176 @@ ++++ +title = "§17 Testing the Compiler" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §17 Testing the Compiler — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 17. Testing the Compiler` + +Replace the stub line with full content. Target 700–900 words. Add unit tests per stage and integration tests that compile MiniLisp → C → binary → run and assert on output. + +## Learning objectives + +- Know which unit tests to write for each compiler stage +- Write integration tests that invoke `cc` and run the result +- Build a test corpus covering all language features +- Understand how to test error paths (invalid programs should produce errors, not panics) + +## Content to write + +### Testing strategy + +Three levels: + +1. **Unit tests** per module: already written in §8–15 for individual parsers and code-gen functions. This section consolidates and supplements them. +2. **`compile()` round-trip tests**: call `compile(source)` and assert on the C output string (does it contain expected fragments?). +3. **Integration tests**: compile → write to temp file → `cc` → run → assert stdout. + +### Additional unit tests + +Supplement the tests from earlier sections with edge cases: + +**Parser edge cases:** +```rust +#[test] fn empty_program() { assert_eq!(parse("").unwrap(), vec![]); } +#[test] fn only_comments() { assert!(parse("; hello\n; world").unwrap().is_empty()); } +#[test] fn nested_parens() { parse("(f (g (h 1)))").unwrap(); } +#[test] fn string_escapes() { assert_eq!(parse(r#""\n""#).unwrap()[0], Expr::Str("\n".into())); } +``` + +**Analyser error cases:** +```rust +#[test] fn undefined_rejects() { assert!(analyse(parse("x").unwrap()).is_err()); } +#[test] fn bad_arity_rejects() { assert!(analyse(parse("(+ 1 2 3)").unwrap()).is_err()); } +#[test] fn valid_mutual_rec() { assert!(analyse(parse(MUTUAL_REC_SRC).unwrap()).is_ok()); } +``` + +### Integration tests + +Write an integration test helper that: +1. Calls `compile(source)` to get C source +2. Writes C source to a temp file (`/tmp/ml_test_{N}.c`) +3. Invokes `cc -o /tmp/ml_test_{N} /tmp/ml_test_{N}.c` +4. Runs the binary and captures stdout +5. Asserts stdout matches expected output +6. Cleans up temp files + +```rust +#[cfg(test)] +fn run_minilisp(source: &str) -> String { + use std::process::Command; + let c_source = crate::compile(source).expect("compile failed"); + let c_path = "/tmp/ml_test.c"; + let bin_path = "/tmp/ml_test"; + std::fs::write(c_path, &c_source).unwrap(); + let cc_status = Command::new("cc") + .args([c_path, "-o", bin_path]) + .status() + .expect("cc not found"); + assert!(cc_status.success(), "C compilation failed:\n{}", c_source); + let output = Command::new(bin_path).output().expect("run failed"); + String::from_utf8(output.stdout).unwrap() +} +``` + +### Test corpus + +Write integration tests for each major language feature: + +```rust +#[test] +fn test_display_integer() { + assert_eq!(run_minilisp("(display 42)(newline)"), "42\n"); +} + +#[test] +fn test_arithmetic() { + assert_eq!(run_minilisp("(display (+ (* 3 4) (- 10 5)))(newline)"), "17\n"); +} + +#[test] +fn test_boolean_if() { + let src = "(display (if #t 1 2))(newline)"; + assert_eq!(run_minilisp(src), "1\n"); +} + +#[test] +fn test_recursive_factorial() { + let src = r#" + (define (factorial n) + (if (= n 0) 1 (* n (factorial (- n 1))))) + (display (factorial 10))(newline) + "#; + assert_eq!(run_minilisp(src), "3628800\n"); +} + +#[test] +fn test_let() { + let src = "(display (let ((x 3) (y 4)) (+ x y)))(newline)"; + assert_eq!(run_minilisp(src), "7\n"); +} + +#[test] +fn test_mutual_recursion() { + let src = r#" + (define (even? n) (if (= n 0) #t (odd? (- n 1)))) + (define (odd? n) (if (= n 0) #f (even? (- n 1)))) + (display (even? 10))(newline) + (display (odd? 7))(newline) + "#; + assert_eq!(run_minilisp(src), "true\ntrue\n"); +} + +#[test] +fn test_higher_order() { + let src = r#" + (define (apply f x) (f x)) + (define (double x) (* x 2)) + (display (apply double 5))(newline) + "#; + assert_eq!(run_minilisp(src), "10\n"); +} + +#[test] +fn test_begin() { + let src = "(begin (display 1)(display 2)(display 3)(newline))"; + assert_eq!(run_minilisp(src), "123\n"); +} +``` + +### Testing error paths + +```rust +#[test] +fn test_undefined_symbol_error() { + assert!(crate::compile("(display undefined-var)").is_err()); +} + +#[test] +fn test_unmatched_paren_error() { + assert!(crate::compile("(define x 1").is_err()); +} + +#[test] +fn test_wrong_arity_error() { + assert!(crate::compile("(+ 1 2 3)").is_err()); +} +``` + +### Running the tests + +```sh +cargo test +``` + +Note: integration tests require `cc` (or `gcc`/`clang`) to be installed. On a system without a C compiler, they will panic with "cc not found". Consider gating them with `#[ignore]` or a feature flag. + +## Style notes + +- The `run_minilisp` helper is the key piece of infrastructure — spend time on it +- The test corpus should cover every special form (define, lambda, if, let, begin) and every built-in +- Temporary file management in tests is inelegant but explicit; mention that `tempfile` crate would be cleaner in production +- Flag the `cc` dependency clearly — integration tests are not hermetic diff --git a/edu/.nbd/tickets/99e1d9.md b/edu/.nbd/tickets/99e1d9.md new file mode 100644 index 0000000..2555b21 --- /dev/null +++ b/edu/.nbd/tickets/99e1d9.md @@ -0,0 +1,37 @@ ++++ +title = "§3 Vector Similarity" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §3 Vector Similarity — Stub to fill + +File: `edu/src/vector-db.md`, section `### 3. Vector Similarity` + +Replace this stub line with full content: +> Once you have two vectors, how do you measure how alike they are? [...] 🚧 Full content tracked in [nbd:99e1d9]. + +This is a **reading lesson with inline math** — no Rust code. Target 400–600 words. Bold lead phrases, inline math using Unicode (not LaTeX). Include a small worked example with concrete 3D numbers. + +## Learning objectives + +- Know the three main similarity/distance functions: cosine similarity, dot product, Euclidean distance +- Understand the formula and geometric meaning of each +- Know the relationship between cosine similarity and cosine distance (what `vector_distance_cos` actually returns) +- Know when each metric is appropriate +- Understand why normalised vectors simplify the choice + +## Content to write + +**Cosine similarity.** Formula: cos(θ) = (a · b) / (‖a‖ · ‖b‖). Range −1 to 1 (1 = same direction, 0 = orthogonal, −1 = opposite). Measures the angle between vectors, ignoring magnitude. Ideal for text embeddings: a short and long document on the same topic produce vectors that differ in magnitude but not direction. + +**Cosine distance.** 1 − cosine_similarity. Range 0 to 2. This is what sqlite-vec's `vector_distance_cos` returns (0 = identical, 2 = fully opposite). Clarify the naming: the function name says "cos" but returns a *distance*, not a similarity — smaller is more similar. + +**Dot product.** Formula: a · b = Σᵢ aᵢbᵢ. For unit-normalised vectors, dot product equals cosine similarity (since ‖a‖ = ‖b‖ = 1 cancels out). For unnormalised vectors, it conflates magnitude and angle. Some models are trained specifically for maximum inner product search (MIPS) — their documentation will say so. + +**Euclidean (L2) distance.** Formula: ‖a − b‖ = √(Σᵢ (aᵢ − bᵢ)²). Range 0 to ∞. Sensitive to vector magnitude. Appropriate for low-dimensional geometric/tabular data where absolute coordinate values carry meaning. + +**When to use each.** Text and sentence embeddings: cosine (or dot product if model outputs unit vectors, which many do). Follow the model card's recommendation when specified. Low-dimensional geometric features: L2. + +**Worked example.** Use vectors a = [1, 0, 1] and b = [1, 1, 0]. Compute all three by hand and show the arithmetic step by step. Cosine similarity = 0.5, L2 distance ≈ 1.414, dot product = 1. This concretises the formulas before the reader sees them in SQL queries. \ No newline at end of file diff --git a/edu/.nbd/tickets/a1a827.md b/edu/.nbd/tickets/a1a827.md new file mode 100644 index 0000000..b8f6dec --- /dev/null +++ b/edu/.nbd/tickets/a1a827.md @@ -0,0 +1,151 @@ ++++ +title = "§7 The Abstract Syntax Tree" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §7 The Abstract Syntax Tree — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 7. The Abstract Syntax Tree` + +Replace the stub line with full content. Target 600–800 words. Define the complete `Expr` enum, explain the design choices, implement `Display`, and show how a MiniLisp program maps to the AST. + +## Learning objectives + +- Understand what an AST is and why it is separate from the concrete syntax +- Know the complete `Expr` enum and all its variants +- Understand the design trade-off between a generic `List` variant and specific special-form variants +- Implement `Display` for `Expr` to enable debugging + +## Content to write + +### What is an AST? + +An Abstract Syntax Tree strips away syntactic noise — parentheses, whitespace, comments — and represents only the semantic structure of a program. Two programs with different whitespace or comment placement produce identical ASTs. The AST is the compiler's internal representation from the parser forward. + +### Design Decision: Generic vs. Specific Variants + +Two approaches for representing Lisp forms in the AST: + +**Option A — Generic list**: everything is either an atom or a `List(Vec)`. Special forms are recognized during semantic analysis or code generation. + +**Option B — Specific variants**: each special form (`Define`, `If`, `Lambda`, etc.) gets its own enum variant, recognized during parsing. + +We use Option B. It means the parser does more work, but the analyser and code generator deal with well-typed data rather than raw lists. Exhaustive pattern matching catches missed cases at compile time. + +### The `Expr` Enum + +Define in `src/ast.rs`: + +```rust +/// A MiniLisp expression — the core AST node type. +#[derive(Debug, Clone, PartialEq)] +pub enum Expr { + /// Integer literal: `42`, `-7` + Int(i64), + /// Boolean literal: `#t`, `#f` + Bool(bool), + /// String literal: `"hello"` + Str(String), + /// Symbol (variable name or operator): `x`, `+`, `my-var` + Symbol(String), + /// Variable binding: `(define name expr)` + Define { + name: String, + value: Box, + }, + /// Function definition shorthand: `(define (name params...) body...)` + /// Desugared by the parser into a `Define` wrapping a `Lambda`. + /// (No separate variant needed.) + + /// Anonymous function: `(lambda (params...) body...)` + Lambda { + params: Vec, + body: Vec, + }, + /// Conditional: `(if cond then else)` + If { + cond: Box, + then: Box, + else_: Box, + }, + /// Local bindings: `(let ((x 1) (y 2)) body...)` + Let { + bindings: Vec<(String, Expr)>, + body: Vec, + }, + /// Sequencing: `(begin expr1 expr2 ...)` + Begin(Vec), + /// Function or operator call: `(f arg1 arg2 ...)` + Call { + func: Box, + args: Vec, + }, +} +``` + +For each variant, explain: +- What MiniLisp syntax it represents +- Why `Box` is needed for recursive fields (Rust requires known size for enum variants) +- Why `body` in `Lambda` and `Let` is `Vec` (multiple expressions, last one is the return value) + +### `Display` implementation + +Implement `Display` for `Expr` so you can print ASTs during development: + +```rust +impl std::fmt::Display for Expr { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + Expr::Int(n) => write!(f, "{}", n), + Expr::Bool(b) => write!(f, "{}", if *b { "#t" } else { "#f" }), + Expr::Str(s) => write!(f, "\"{}\"", s.escape_default()), + Expr::Symbol(s) => write!(f, "{}", s), + Expr::Define { name, value } => write!(f, "(define {} {})", name, value), + Expr::Lambda { params, body } => { + write!(f, "(lambda ({}) ", params.join(" "))?; + for (i, e) in body.iter().enumerate() { + if i > 0 { write!(f, " ")?; } + write!(f, "{}", e)?; + } + write!(f, ")") + } + Expr::If { cond, then, else_ } => write!(f, "(if {} {} {})", cond, then, else_), + Expr::Let { bindings, body } => { + write!(f, "(let (")?; + for (name, val) in bindings { + write!(f, "({} {})", name, val)?; + } + write!(f, ") ")?; + for e in body { write!(f, "{}", e)?; } + write!(f, ")") + } + Expr::Begin(exprs) => { + write!(f, "(begin ")?; + for (i, e) in exprs.iter().enumerate() { + if i > 0 { write!(f, " ")?; } + write!(f, "{}", e)?; + } + write!(f, ")") + } + Expr::Call { func, args } => { + write!(f, "({}", func)?; + for a in args { write!(f, " {}", a)?; } + write!(f, ")") + } + } + } +} +``` + +### Mapping Example + +Show how the factorial program from §1 maps to AST values. Write out the `Expr` tree for `(define (factorial n) (if (= n 0) 1 (* n (factorial (- n 1)))))` in Rust `Expr` literal notation. This makes the structure concrete. + +## Style notes + +- The design-decision discussion (generic vs. specific) should come before the code — readers should understand *why* we chose specific variants +- Every variant should have a comment showing the corresponding MiniLisp syntax +- The `Display` impl is a debugging aid; note that it is not tested for correctness beyond "it does not panic" diff --git a/edu/.nbd/tickets/a4c9f8.md b/edu/.nbd/tickets/a4c9f8.md new file mode 100644 index 0000000..cefe01d --- /dev/null +++ b/edu/.nbd/tickets/a4c9f8.md @@ -0,0 +1,192 @@ ++++ +title = "§9 Parsing S-Expressions and Special Forms" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §9 Parsing S-Expressions and Special Forms — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 9. Parsing S-Expressions and Special Forms` + +Replace the stub line with full content. Target 1000–1300 words. This is the hardest parsing section — recursive parsers, special-form recognition, and the top-level `parse` entry point. + +## Learning objectives + +- Write a recursive parser in nom (handling the recursion challenge) +- Distinguish special forms from generic calls during parsing and produce typed AST variants +- Parse `define`, `lambda`, `if`, `let`, `begin` into the correct `Expr` variants +- Implement the top-level `parse` function +- Understand when to use `cut` to commit to a parse branch + +## Content to write + +### The Recursion Problem in nom + +nom parsers must have known types at compile time, but a parser for S-expressions is recursive: an expression is either an atom or a list of expressions. Rust's type system normally prevents this with "infinite type" errors. + +Solution: use a function definition rather than a closure, and break the cycle with a forward reference. In Rust, a named function works because the function pointer has a known size. + +```rust +pub fn parse_expr(input: &str) -> IResult<&str, Expr> { + ws(alt(( + parse_list, + parse_atom, + )))(input) +} +``` + +`parse_list` calls `parse_expr` recursively. Because `parse_expr` is a named function (not a closure), its type is `fn(&str) -> IResult<&str, Expr>` — a known size — so the recursion is fine. + +### Parsing Generic Lists → Calls + +A generic list `(func arg1 arg2 ...)` is parsed into `Expr::Call`: + +```rust +fn parse_call(input: &str) -> IResult<&str, Expr> { + let (input, exprs) = delimited( + ws(char('(')), + many1(ws(parse_expr)), + ws(char(')')), + )(input)?; + let mut iter = exprs.into_iter(); + let func = iter.next().unwrap(); // safe: many1 guarantees >= 1 + let args = iter.collect(); + Ok((input, Expr::Call { func: Box::new(func), args })) +} +``` + +### Recognizing Special Forms + +Special forms are lists that begin with a specific keyword. Recognize them *inside* the list parser by peeking at the first token. The cleanest approach: try each special-form parser in an `alt` before falling back to `parse_call`. + +```rust +fn parse_list(input: &str) -> IResult<&str, Expr> { + alt(( + parse_define, + parse_lambda, + parse_if, + parse_let, + parse_begin, + parse_call, + ))(input) +} +``` + +### Parsing `define` + +Two shapes: `(define name expr)` and `(define (name params...) body...)`. Parse both; the second desugars into a `Define` wrapping a `Lambda`. + +```rust +fn parse_define(input: &str) -> IResult<&str, Expr> { + let (input, _) = ws(char('('))(input)?; + let (input, _) = ws(tag("define"))(input)?; + // Use cut here: we've seen "(define", so commit to this branch + cut(|input| { + alt(( + // Function shorthand: (define (name params...) body...) + |input| { + let (input, _) = ws(char('('))(input)?; + let (input, name) = ws(parse_symbol_str)(input)?; + let (input, params) = many0(ws(parse_symbol_str))(input)?; + let (input, _) = ws(char(')'))(input)?; + let (input, body) = many1(ws(parse_expr))(input)?; + let (input, _) = ws(char(')'))(input)?; + let lambda = Expr::Lambda { params, body }; + Ok((input, Expr::Define { name: name.to_string(), value: Box::new(lambda) })) + }, + // Variable binding: (define name expr) + |input| { + let (input, name) = ws(parse_symbol_str)(input)?; + let (input, value) = ws(parse_expr)(input)?; + let (input, _) = ws(char(')'))(input)?; + Ok((input, Expr::Define { name: name.to_string(), value: Box::new(value) })) + }, + ))(input) + })(input) +} +``` + +Explain `cut`: after matching `(define`, we are committed to this branch. If the body is malformed, `cut` converts recoverable errors to failures, producing better error messages and preventing backtracking to `parse_call`. + +### Parsing `lambda`, `if`, `let`, `begin` + +Show each parser in similar style. Key details: + +**`lambda`**: `(lambda (params...) body...)` — use `many0` for params (zero-parameter functions are valid), `many1` for body. + +**`if`**: `(if cond then else)` — exactly three sub-expressions; the third (`else`) is required in MiniLisp. + +**`let`**: `(let ((name expr)...) body...)` — parse a list of `(name expr)` pairs, collect into `Vec<(String, Expr)>`. + +**`begin`**: `(begin expr...)` — one or more expressions. + +### Comments in the expression parser + +Comments must be silently consumed wherever whitespace is allowed. Update `ws` (or create a separate `skip` combinator) to skip both whitespace and comments: + +```rust +fn skip(input: &str) -> IResult<&str, ()> { + value((), many0(alt(( + value((), multispace1), + value((), pair(char(';'), opt(is_not("\n\r")))), + ))))(input) +} +``` + +Then use `skip` in place of `multispace0` in the `ws` wrapper. + +### The top-level `parse` function + +```rust +/// Parse a complete MiniLisp program (zero or more top-level expressions). +pub fn parse(source: &str) -> Result, crate::error::CompileError> { + let (remaining, exprs) = many0(ws(parse_expr))(source) + .map_err(|e| crate::error::CompileError::ParseError(e.to_string()))?; + if !remaining.trim().is_empty() { + return Err(crate::error::CompileError::ParseError( + format!("unexpected input: {:?}", &remaining[..remaining.len().min(20)]) + )); + } + Ok(exprs) +} +``` + +### Unit tests + +```rust +#[test] +fn test_parse_if() { + let src = "(if #t 1 2)"; + let result = parse(src).unwrap(); + assert_eq!(result.len(), 1); + assert!(matches!(result[0], Expr::If { .. })); +} + +#[test] +fn test_parse_define_fn() { + let src = "(define (add a b) (+ a b))"; + let result = parse(src).unwrap(); + assert!(matches!(&result[0], Expr::Define { name, .. } if name == "add")); +} + +#[test] +fn test_nested_calls() { + let src = "(display (* 2 (+ 3 4)))"; + assert!(parse(src).is_ok()); +} + +#[test] +fn test_comments_skipped() { + let src = "; this is a comment\n(define x 42)"; + assert!(parse(src).is_ok()); +} +``` + +## Style notes + +- The recursion problem is the hardest conceptual moment — explain it thoroughly before showing the solution +- `cut` is essential for good error messages; explain why each use of `cut` is there +- The top-level `parse` function must check for unconsumed input — show why (trailing garbage would otherwise be silently ignored) +- End with a checkpoint: parse the complete factorial example and print the AST using the `Display` impl from §7 diff --git a/edu/.nbd/tickets/a93829.md b/edu/.nbd/tickets/a93829.md new file mode 100644 index 0000000..f2ce56f --- /dev/null +++ b/edu/.nbd/tickets/a93829.md @@ -0,0 +1,125 @@ ++++ +title = "§2 MiniLisp Language Specification" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §2 MiniLisp Language Specification — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 2. MiniLisp Language Specification` + +Replace the stub line with full content. Target 700–1000 words. This is a reference section — clear, precise, complete. Define the entire language before the reader writes a single parser. + +## Learning objectives + +- Know every data type and its literal syntax +- Understand every special form and its evaluation rule +- Know which built-in operators and functions exist +- Know what is explicitly out of scope +- Have seen a complete, realistic MiniLisp program + +## Content to write + +### EBNF Grammar + +Lead with the grammar as a compact reference: + +```ebnf +program = expr* EOF +expr = atom | list | comment +atom = integer | boolean | string | symbol +integer = '-'? DIGIT+ +boolean = '#t' | '#f' +string = '"' (char | escape)* '"' +escape = '\\' ('"' | '\\' | 'n' | 't') +symbol = sym_start sym_cont* +sym_start = ALPHA | '-' | '_' | '?' | '!' | '+' | '*' | '/' | '=' | '<' | '>' +sym_cont = sym_start | DIGIT +list = '(' expr* ')' +comment = ';' (not NEWLINE)* NEWLINE +``` + +Note: symbol must not match `-` followed by a digit (that is a negative integer). + +### Data Types + +**Integers.** 64-bit signed. Optional `-` followed by one or more decimal digits. Examples: `42`, `-7`, `0`. + +**Booleans.** `#t` (true) and `#f` (false). + +**Strings.** Double-quoted. Supported escapes: `\"`, `\\`, `\n`, `\t`. Example: `"hello, world\n"`. + +**Symbols.** Identifiers made of letters, digits, and `-_?!+*/=<>`. Must not begin with a digit. Must not begin with `-` followed by a digit. Examples: `x`, `my-var`, `factorial`, `zero?`, `+`. + +**Comments.** `;` to end of line. No block comments. + +### Special Forms + +For each form: syntax pattern, evaluation rule, and one or two examples. + +**`(define )`** — Bind `` to the value of ``. At top level, creates a global. Inside a function body, creates a local. + +**`(define ( ...) ...)`** — Shorthand for `(define (lambda (...) ...))`. Requires at least one body expression. + +**`(lambda (...) ...)`** — Creates a function. Parameters and body are evaluated in a new scope. Returns the value of the last body expression. In MiniLisp, lambdas may only reference their own parameters and top-level names (no closures over enclosing function locals). + +**`(if )`** — Evaluates ``; returns `` or `` depending on truthiness. Both branches are required. + +**`(let (( )...) ...)`** — Evaluates each `` in the current scope, then binds results to the corresponding `` in a new inner scope, then evaluates `...` in that scope. Returns the last body value. + +**`(begin ...)`** — Evaluates each expression in order; returns the last value. Used when multiple side effects are needed inside an `if` branch. + +### Built-in Operators + +These compile directly to C infix operators. All take exactly two arguments except `not` (one argument). + +| Form | C equivalent | Return type | +|---|---|---| +| `(+ a b)` | `a + b` | integer | +| `(- a b)` | `a - b` | integer | +| `(* a b)` | `a * b` | integer | +| `(/ a b)` | `a / b` | integer (truncating) | +| `(= a b)` | `a == b` | boolean | +| `(< a b)` | `a < b` | boolean | +| `(> a b)` | `a > b` | boolean | +| `(<= a b)` | `a <= b` | boolean | +| `(>= a b)` | `a >= b` | boolean | +| `(not a)` | `!a` | boolean | + +### Built-in Functions + +These compile to C function calls defined in the preamble. + +| Form | Behaviour | +|---|---| +| `(display expr)` | Print value to stdout; integers with `%ld`, booleans as `true`/`false`, strings with `%s` | +| `(newline)` | Print `\n` to stdout | +| `(error msg)` | Print `msg` to stderr and call `exit(1)` | + +### What Is NOT Supported + +Be explicit. This sets expectations and prevents confusion: + +- No floating-point numbers +- No pairs or lists (`cons`, `car`, `cdr`) +- No closures (lambdas cannot capture enclosing function locals) +- No tail-call optimisation +- No garbage collector +- No variadic functions +- No macros +- No `quote` or `quasiquote` + +Section 18 discusses how these could be added. + +### Complete Example Program + +End with a realistic multi-function program that exercises: `define` (function and variable), `if`, recursion, `let`, `begin`, `display`, `newline`, and arithmetic. A good choice: define `even?`, `odd?`, and `collatz-length`, then print results for a few inputs. + +## Style notes + +- Grammar first (dense reference), then prose elaboration +- Tables for operators and built-ins +- Complete example is the last thing — a reward for reading the spec +- Be precise about symbol character set; ambiguity here causes parser bugs diff --git a/edu/.nbd/tickets/b6c9ad.md b/edu/.nbd/tickets/b6c9ad.md new file mode 100644 index 0000000..7834eaa --- /dev/null +++ b/edu/.nbd/tickets/b6c9ad.md @@ -0,0 +1,137 @@ ++++ +title = "§8 Parsing Atoms with nom" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §8 Parsing Atoms with nom — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 8. Parsing Atoms with nom` + +Replace the stub line with full content. Target 600–800 words. This section takes the individual atom parsers from §6 and the AST from §7 and combines them into a single `parse_atom` function that returns `IResult<&str, Expr>`. Includes tests. + +## Learning objectives + +- Combine individual atom parsers into a single `alt` using `map` to produce `Expr` values +- Understand how to add `src/parser.rs` to the project properly +- Write comprehensive unit tests for atom parsing +- Handle the ordering constraint in `alt`: integers before symbols + +## Content to write + +### The `parse_atom` function + +In `src/parser.rs`, import the atom parsers from §6 and the `Expr` type from `src/ast.rs`, then combine them: + +```rust +use nom::{IResult, branch::alt, combinator::map}; +use crate::ast::Expr; + +/// Parse any MiniLisp atom: integer, boolean, string, or symbol. +pub fn parse_atom(input: &str) -> IResult<&str, Expr> { + alt(( + map(parse_integer, Expr::Int), + map(parse_bool, Expr::Bool), + map(parse_string, Expr::Str), + map(parse_symbol, |s: &str| Expr::Symbol(s.to_string())), + ))(input) +} +``` + +Explain the ordering: +1. **Integer before symbol**: `-7` must match as integer, not as a symbol starting with `-`. Because `parse_integer` consumes the full `-7` before `parse_symbol` is tried, the ordering ensures correct behavior. +2. **Boolean before symbol**: `#t` and `#f` are not valid symbols (since `#` is not a symbol-start character), so ordering here does not matter — but it is cleaner to try booleans first. +3. **String last among atoms**: no overlap with the others since strings start with `"`. + +### Module organisation + +Show the complete `src/parser.rs` header at this point: + +```rust +//! Parser for MiniLisp source code. +//! +//! Entry point: [`parse`] which accepts a `&str` and returns `Vec`. + +use nom::{ + IResult, + branch::alt, + bytes::complete::{escaped_transform, is_not, tag, take_while, take_while_m_n}, + character::complete::{char, digit1, multispace0, line_ending}, + combinator::{map, map_res, opt, recognize, value}, + sequence::{delimited, pair}, +}; + +use crate::ast::Expr; +``` + +### Whitespace-aware atom parser + +Wrap `parse_atom` in the `ws` combinator so callers do not have to think about surrounding whitespace: + +```rust +pub fn parse_atom_ws(input: &str) -> IResult<&str, Expr> { + ws(parse_atom)(input) +} +``` + +### Unit tests + +Write a `#[cfg(test)]` module in `src/parser.rs` testing every atom type with multiple cases: + +```rust +#[cfg(test)] +mod tests { + use super::*; + use crate::ast::Expr; + + #[test] + fn test_integer_atom() { + assert_eq!(parse_atom("42"), Ok(("", Expr::Int(42)))); + assert_eq!(parse_atom("-7 "), Ok((" ", Expr::Int(-7)))); + assert_eq!(parse_atom("0"), Ok(("", Expr::Int(0)))); + } + + #[test] + fn test_bool_atom() { + assert_eq!(parse_atom("#t"), Ok(("", Expr::Bool(true)))); + assert_eq!(parse_atom("#f"), Ok(("", Expr::Bool(false)))); + } + + #[test] + fn test_string_atom() { + assert_eq!(parse_atom(r#""hello""#), Ok(("", Expr::Str("hello".into())))); + assert_eq!(parse_atom(r#""a\nb""#), Ok(("", Expr::Str("a\nb".into())))); + } + + #[test] + fn test_symbol_atom() { + assert_eq!(parse_atom("my-var"), Ok(("", Expr::Symbol("my-var".into())))); + assert_eq!(parse_atom("+"), Ok(("", Expr::Symbol("+".into())))); + assert_eq!(parse_atom("factorial rest"), Ok((" rest", Expr::Symbol("factorial".into())))); + } + + #[test] + fn test_negative_integer_vs_symbol() { + // -7 must be an integer, not a symbol + assert_eq!(parse_atom("-7"), Ok(("", Expr::Int(-7)))); + // lone - is a symbol + assert_eq!(parse_atom("- "), Ok((" ", Expr::Symbol("-".into())))); + } +} +``` + +### Run the tests + +```sh +cargo test parser +``` + +All tests should pass before proceeding to §9. + +## Style notes + +- The ordering section is the most important teaching moment here — make it explicit +- Show how `map` is used to lift a primitive value into an `Expr` variant +- The test for `-7` vs `-` (lone minus) is critical — flag it as something to get right diff --git a/edu/.nbd/tickets/b7c95f.md b/edu/.nbd/tickets/b7c95f.md new file mode 100644 index 0000000..f97f41b --- /dev/null +++ b/edu/.nbd/tickets/b7c95f.md @@ -0,0 +1,50 @@ ++++ +title = "vector-db" +priority = 5 +status = "todo" +ticket_type = "project" +dependencies = ["21d9be", "584e0c", "99e1d9", "d9f850", "6ec5ff", "37cdd5", "081a55", "5674ce", "4c961f", "1ef9f4", "e8be9a", "5ed295"] ++++ +## Project: Vector Database Self-Guided Course + +This is the top-level project ticket for `edu/src/vector-db.md` — a self-guided mdbook course on vector databases in the **Vibed Learning** site (`edu/`). + +The course is modelled on `edu/src/markov.md`. It teaches vector databases through 12 sections across 4 parts, mixing reading lessons and hands-on Rust programming exercises using Turso (`libsql` crate) and sqlite-vec for local vector storage. + +## Course structure + +| # | Title | Status | +|---|---|---| +| §1 | What Is a Vector? | Written in full | +| §2 | Embeddings | Stub [nbd:584e0c] | +| §3 | Vector Similarity | Stub [nbd:99e1d9] | +| §4 | What Is a Vector Database? | Stub [nbd:d9f850] | +| §5 | Under the Hood: ANN Algorithms | Stub [nbd:6ec5ff] | +| §6 | Setting Up | Written in full | +| §7 | Exercise 1 — Storing and Retrieving Vectors | Stub [nbd:081a55] | +| §8 | Exercise 2 — K-Nearest Neighbor Search | Stub [nbd:5674ce] | +| §9 | Generating Embeddings in Rust | Stub [nbd:4c961f] | +| §10 | Exercise 3 — Semantic Document Search | Stub [nbd:1ef9f4] | +| §11 | Exercise 4 — Recommendation Engine | Stub [nbd:e8be9a] | +| §12 | Exercise 5 — Retrieval-Augmented Generation | Stub [nbd:5ed295] | + +## Filling a stub + +1. Open `edu/src/vector-db.md` +2. Find the section (e.g. `### 2. Embeddings`) +3. Replace the stub line (`🚧 Full content tracked in [nbd:...]`) with full content +4. Run `mdbook build` from `edu/` — must pass cleanly +5. Mark the section ticket done + +## Tech stack used in exercises + +- **Runtime:** Tokio async +- **DB crate:** `libsql = "0.9"` (Turso / libSQL Rust client) +- **Vector support:** sqlite-vec, built into libsql — no extra install +- **Embeddings:** `fastembed` crate (local) or OpenAI-compatible HTTP API +- **Local connection:** `Builder::new_local("vectors.db").build().await?` +- **Vector column type:** `F32_BLOB(d)` where d = embedding dimension +- **KNN query:** `vector_top_k('table', vector('[...]'), k)` table-valued function +- **Distance function:** `vector_distance_cos(a, b)` — 0 = identical, 2 = opposite + +This project ticket closes when all 12 section tickets are done and `mdbook build` passes. \ No newline at end of file diff --git a/edu/.nbd/tickets/cbc6e3.md b/edu/.nbd/tickets/cbc6e3.md new file mode 100644 index 0000000..2372866 --- /dev/null +++ b/edu/.nbd/tickets/cbc6e3.md @@ -0,0 +1,157 @@ ++++ +title = "§14 Generating C: Definitions and Functions" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §14 Generating C: Definitions and Functions — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 14. Generating C: Definitions and Functions` + +Replace the stub line with full content. Target 700–900 words. Implement code generation for top-level `define` forms and `lambda` expressions, including forward declarations for mutual recursion. + +## Learning objectives + +- Emit forward declarations for all functions before their definitions +- Generate a correct C function signature from a `Lambda` with named parameters +- Handle variable `define` vs. function `define` +- Understand C's requirement for forward declarations and why MiniLisp needs them + +## Content to write + +### Why forward declarations? + +In C, a function must be declared before it is called. If `even?` calls `odd?` and `odd?` calls `even?`, whichever is defined first will try to call a symbol that has not yet been declared. Forward declarations — just the function signature with no body — solve this by telling the C compiler the signature exists before the definition appears. + +MiniLisp makes no guarantees about definition order, so we emit forward declarations for every top-level function before any definition. + +### Two-pass code generation + +The code generator uses two passes over the top-level `Vec`: + +1. **Forward declaration pass**: emit `ml_int ml_name(ml_int param1, ...);` for every top-level `define` that wraps a `lambda`. +2. **Definition pass**: emit the full function body (or variable initializer) for every top-level `define`. + +### Type signatures + +MiniLisp has no type annotations. All values compile to `ml_int` (which is `int64_t`). This includes: +- Integers: trivially `ml_int` +- Booleans: stored as `ml_int` (0 or 1) +- Strings: a limitation — string-returning functions are declared as `ml_int` too, which is technically wrong but will compile for our simple programs. Acknowledge this simplification. + +A more honest approach would be to use `void*` or a tagged union — note this in the "What's Next" section. + +### Generating a forward declaration + +```rust +fn gen_forward_decl(name: &str, lambda: &Expr) -> String { + if let Expr::Lambda { params, .. } = lambda { + let c_name = mangle(name); + let param_list: Vec = params.iter() + .map(|p| format!("ml_int {}", mangle(p))) + .collect(); + format!("ml_int {}({});\n", c_name, param_list.join(", ")) + } else { + String::new() // variable define; no forward declaration needed + } +} +``` + +### Generating a function definition + +```rust +fn gen_function_def(name: &str, params: &[String], body: &[Expr]) -> String { + let c_name = mangle(name); + let param_list: Vec = params.iter() + .map(|p| format!("ml_int {}", mangle(p))) + .collect(); + let mut out = format!("ml_int {}({}) {{\n", c_name, param_list.join(", ")); + + // All body expressions except the last are statements (side effects) + for expr in &body[..body.len() - 1] { + out.push_str(&format!(" {};\n", gen_stmt(expr))); + } + // Last body expression is the return value + let last = body.last().unwrap(); + out.push_str(&format!(" return {};\n", gen_expr(last))); + out.push_str("}\n"); + out +} +``` + +Explain the idiom: all but the last body expression are evaluated as statements (for side effects like `display`); the last is used as the return value. This mirrors Lisp's implicit return of the last expression. + +### Generating a variable definition + +```rust +fn gen_variable_def(name: &str, value: &Expr) -> String { + format!("ml_int {} = {};\n", mangle(name), gen_expr(value)) +} +``` + +Variable definitions at top level become global C variables. + +### The full `generate` function + +```rust +pub fn generate(exprs: Vec) -> String { + let mut out = String::new(); + out.push_str(PREAMBLE); + + // Pass 1: forward declarations for all top-level functions + for expr in &exprs { + if let Expr::Define { name, value } = expr { + out.push_str(&gen_forward_decl(name, value)); + } + } + out.push('\n'); + + // Pass 2: definitions + for expr in &exprs { + match expr { + Expr::Define { name, value } => match value.as_ref() { + Expr::Lambda { params, body } => + out.push_str(&gen_function_def(name, params, body)), + _ => + out.push_str(&gen_variable_def(name, value)), + } + // Top-level non-define expressions: emit in main() + _ => {} // handled in §16 + } + } + + out +} +``` + +### Tests + +```rust +#[test] +fn test_simple_function() { + let src = "(define (square x) (* x x))"; + let exprs = parse(src).unwrap(); + let c = generate(exprs); + assert!(c.contains("ml_int ml_square(ml_int ml_x)")); + assert!(c.contains("return (ml_x * ml_x)")); +} + +#[test] +fn test_forward_decl_present() { + let src = "(define (f x) (g x))\n(define (g x) x)"; + let c = generate(parse(src).unwrap()); + // f's forward decl must appear before g's definition + let fwd_pos = c.find("ml_int ml_f(").unwrap(); + let def_pos = c.find("ml_int ml_g(ml_int ml_x) {").unwrap(); + assert!(fwd_pos < def_pos); +} +``` + +## Style notes + +- Lead with the forward declaration problem — it's the "aha" moment of this section +- The two-pass structure is conceptually important; diagram it clearly +- Acknowledge the "everything is ml_int" simplification explicitly; readers will notice it +- The `body[..body.len()-1]` slice for all-but-last is a small Rust trick worth calling out diff --git a/edu/.nbd/tickets/d0b9f8.md b/edu/.nbd/tickets/d0b9f8.md new file mode 100644 index 0000000..3fb9c91 --- /dev/null +++ b/edu/.nbd/tickets/d0b9f8.md @@ -0,0 +1,154 @@ ++++ +title = "§10 Symbol Tables and Scope" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §10 Symbol Tables and Scope — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 10. Symbol Tables and Scope` + +Replace the stub line with full content. Target 700–900 words. Build the environment chain that represents lexical scope, then write the scope-checking traversal. Reading-heavy with moderate code. + +## Learning objectives + +- Understand what a symbol table is and why it is needed +- Implement an environment chain (linked scope structure) in Rust +- Write an AST traversal that resolves all symbol references +- Produce clear `SemanticError` messages for undefined names + +## Content to write + +### What is a symbol table? + +A symbol table maps names to information about them — where they are defined, their type (if we had types), and any other metadata. Our symbol table is simple: a set of names that are currently in scope. We just need to know *whether* a name is defined; we do not need to know *what* it is (no type information). + +### Lexical scope + +In MiniLisp, scope is lexical (also called static): a name's binding is determined by the syntactic structure of the program, not by the runtime call stack. When you write `(lambda (x) x)`, `x` is in scope inside the lambda body regardless of what `x` means in the surrounding context. + +### The environment chain + +Represent scope as a chain of `HashSet`, one per scope level. Looking up a name means searching from innermost to outermost. + +```rust +use std::collections::HashSet; + +/// A chain of scopes representing the current lexical environment. +pub struct Env<'a> { + names: HashSet, + parent: Option<&'a Env<'a>>, +} + +impl<'a> Env<'a> { + pub fn new() -> Self { + Env { names: HashSet::new(), parent: None } + } + + pub fn child(&'a self) -> Env<'a> { + Env { names: HashSet::new(), parent: Some(self) } + } + + pub fn define(&mut self, name: &str) { + self.names.insert(name.to_string()); + } + + pub fn is_defined(&self, name: &str) -> bool { + self.names.contains(name) || self.parent.map_or(false, |p| p.is_defined(name)) + } +} +``` + +Explain the lifetime `'a`: the child env borrows the parent env. Since children always have shorter lifetimes than parents (they go out of scope at the closing `)` of a `lambda` or `let`), this is safe. + +### Pre-populating the global environment + +Built-in operators and functions (`+`, `-`, `*`, `/`, `=`, `<`, `>`, `<=`, `>=`, `not`, `display`, `newline`, `error`) must be defined in the global env from the start — they are always available without a `define`. + +```rust +pub fn global_env() -> Env<'static> { + let mut env = Env::new(); + for name in ["+", "-", "*", "/", "=", "<", ">", "<=", ">=", + "not", "display", "newline", "error"] { + env.define(name); + } + env +} +``` + +### The scope-checking traversal + +Walk the `Vec` and call `check_expr` on each. The `check_expr` function pattern-matches on each `Expr` variant: + +```rust +pub fn check_expr(expr: &Expr, env: &Env) -> Result<(), CompileError> { + match expr { + Expr::Symbol(name) => { + if !env.is_defined(name) { + return Err(CompileError::SemanticError( + format!("undefined symbol: `{}`", name) + )); + } + Ok(()) + } + Expr::Define { name, value } => { + check_expr(value, env)?; + // Note: we don't add `name` to env here because top-level defines + // are processed in a first pass (see below). + Ok(()) + } + Expr::Lambda { params, body } => { + let mut child = env.child(); + for p in params { child.define(p); } + for e in body { check_expr(e, &child)?; } + Ok(()) + } + Expr::If { cond, then, else_ } => { + check_expr(cond, env)?; + check_expr(then, env)?; + check_expr(else_, env) + } + // ... Let, Begin, Call, atoms (atoms other than Symbol always pass) + _ => Ok(()) + } +} +``` + +### Two-pass processing for mutual recursion + +Top-level `define` forms can reference each other mutually (e.g., `even?` calling `odd?` and vice versa). A single left-to-right pass would reject the second function because the first is not yet defined. + +Solution: a two-pass approach. +1. First pass: scan all top-level `Expr::Define` forms and add their names to the global env. +2. Second pass: check every expression with the fully-populated global env. + +Show this in the `analyse` entry point: + +```rust +pub fn analyse(exprs: Vec) -> Result, CompileError> { + let mut env = global_env(); + // First pass: register all top-level names + for expr in &exprs { + if let Expr::Define { name, .. } = expr { + env.define(name); + } + } + // Second pass: check all expressions + for expr in &exprs { + check_expr(expr, &env)?; + } + Ok(exprs) +} +``` + +### Unit tests + +Test: undefined symbol rejected, mutually recursive defines accepted, lambda scope is isolated, let bindings are in scope inside body. + +## Style notes + +- Motivate the environment chain before defining it — readers who have not seen this technique before will find it conceptually elegant once explained +- The two-pass trick is a genuine insight — give it appropriate emphasis +- Note that we return `Ok(exprs)` unchanged — the analyser is purely a checker; it does not transform the AST diff --git a/edu/.nbd/tickets/d9f850.md b/edu/.nbd/tickets/d9f850.md new file mode 100644 index 0000000..0635c37 --- /dev/null +++ b/edu/.nbd/tickets/d9f850.md @@ -0,0 +1,46 @@ ++++ +title = "§4 What Is a Vector Database?" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §4 What Is a Vector Database? — Stub to fill + +File: `edu/src/vector-db.md`, section `### 4. What Is a Vector Database?` + +Replace this stub line with full content: +> A vector database is a data store built around one core operation [...] 🚧 Full content tracked in [nbd:d9f850]. + +This is a **reading lesson** — no Rust code. Target 500–700 words. Bold lead phrases. + +## Learning objectives + +- Understand the core operation: approximate nearest-neighbour (ANN) search +- Know the primary use cases that motivate vector databases +- Understand how vector databases differ from relational DBs and full-text search +- Know the key performance metrics: recall@k, QPS, index build time, memory + +## Content to write + +**The core operation.** Given a query vector q and n stored vectors, find the k vectors most similar to q. Exact KNN is O(n·d) per query — at n=1M and d=768 this means 768M operations per query, too slow for interactive use. Vector databases use ANN algorithms (see §5) that trade a small accuracy loss for orders-of-magnitude speed gains. + +**Use cases.** Each described in one concrete sentence: +- Semantic search: find documents matching the *meaning* of a query, not just the words +- Recommendation: given an item, return the k most similar items (§11) or surface content preferred by similar users +- Retrieval-Augmented Generation (RAG): retrieve relevant passages before prompting an LLM, so the answer is grounded in facts (§12) +- Duplicate/near-duplicate detection: find items semantically identical or very close to a given item +- Anomaly detection: items far from all stored vectors are likely anomalous +- Multi-modal search: find images matching a text description, using CLIP-style joint embeddings + +**vs. relational databases.** SQL WHERE clauses do exact matches and range queries on scalar values. There is no built-in notion of "nearest" for float arrays. Extensions like pgvector (PostgreSQL) and sqlite-vec (SQLite / Turso) add vector search to existing databases — this course uses sqlite-vec via the `libsql` crate. + +**vs. full-text search (BM25/TF-IDF).** Keyword search cannot handle synonymy (car ≠ automobile without explicit expansion) or concept-level similarity. Vector search captures both. Hybrid search — combining BM25 and ANN scores — is a common production pattern that outperforms either alone. + +**Key metrics.** +- Recall@k: fraction of the true k nearest neighbours that the ANN algorithm returns. A recall@10 of 0.95 means 95% of correct results are found. +- QPS: queries per second the index can serve at a given recall target. +- Index build time: one-time cost paid before serving queries. +- Memory footprint: HNSW stores graph edges in RAM; this limits how large the index can grow on a single machine. + +**Where sqlite-vec / Turso fits.** sqlite-vec is appropriate for embedded applications, local development, and small-to-medium corpora (up to a few million vectors). Dedicated cloud vector databases (Pinecone, Qdrant, Weaviate) handle larger scale and add features like multi-tenancy, filtering, and distributed search. \ No newline at end of file diff --git a/edu/.nbd/tickets/de82f1.md b/edu/.nbd/tickets/de82f1.md new file mode 100644 index 0000000..b723daf --- /dev/null +++ b/edu/.nbd/tickets/de82f1.md @@ -0,0 +1,157 @@ ++++ +title = "§15 Generating C: Control Flow and Sequencing" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §15 Generating C: Control Flow and Sequencing — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 15. Generating C: Control Flow and Sequencing` + +Replace the stub line with full content. Target 700–900 words. Handle the remaining forms: `let`, `begin`, and `display`/`newline`/`error` as statements. Introduces the expression-vs-statement distinction in code generation. + +## Learning objectives + +- Understand when to emit C expressions vs. C statements +- Implement `gen_stmt` for side-effecting expressions +- Generate `let` as a C block with local variable declarations +- Generate `begin` as a sequence of statements with the last value forwarded +- Generate `display`, `newline`, `error` as C function calls + +## Content to write + +### The expression-vs-statement problem + +`gen_expr` from §13 generates C *expressions* — code that produces a value. But some MiniLisp constructs are used for their *side effects*: `display` prints something; `begin` sequences multiple expressions; `let` introduces a new scope. These map more naturally to C *statements*. + +The solution: introduce `gen_stmt(expr: &Expr) -> String` that generates a C statement (terminated with `;` or wrapped in `{}`) for forms that are used in statement position. `gen_expr` handles forms in expression position. Some forms (like `if`) can appear in either position and need both paths. + +### `gen_stmt` — the statement generator + +```rust +/// Generate a C statement from a MiniLisp expression. +/// +/// Used for: body expressions in functions, let bodies, begin sequences. +pub fn gen_stmt(expr: &Expr) -> String { + match expr { + // Side-effecting built-ins + Expr::Call { func, args } if is_builtin_stmt(func) => gen_display_stmt(func, args), + // Everything else: evaluate as an expression and discard the value + _ => format!("(void){};", gen_expr(expr)), + } +} + +fn is_builtin_stmt(func: &Expr) -> bool { + matches!(func, Expr::Symbol(s) if matches!(s.as_str(), "display" | "newline" | "error")) +} +``` + +### Generating `display`, `newline`, `error` + +```rust +fn gen_display_stmt(func: &Expr, args: &[Expr]) -> String { + match func { + Expr::Symbol(s) => match s.as_str() { + "display" => { + // We emit ml_display_int for all non-string arguments. + // A type-aware compiler would choose ml_display_str for string expressions. + let arg = gen_expr(&args[0]); + match &args[0] { + Expr::Str(_) => format!("ml_display_str({});", arg), + Expr::Bool(_) => format!("ml_display_bool({});", arg), + _ => format!("ml_display_int({});", arg), + } + } + "newline" => "ml_newline();".to_string(), + "error" => format!("ml_error({});", gen_expr(&args[0])), + _ => unreachable!(), + } + _ => unreachable!(), + } +} +``` + +Note the simplification: `display` picks the C variant based on the *static* form of the argument. `(display x)` where `x` is a symbol always emits `ml_display_int(ml_x)`, even if `x` holds a boolean at runtime. For the programs in this course, this is acceptable. A production compiler would use a tagged union or a format string approach. + +### Generating `let` + +`let` compiles to a C block with local variable declarations: + +```lisp +(let ((x 1) (y 2)) (+ x y)) +``` + +→ + +```c +({ + ml_int ml_x = 1; + ml_int ml_y = 2; + (ml_x + ml_y); +}) +``` + +This uses GCC's *statement expression* extension: `({ ... })` is a block that returns the value of its last statement. This extension is supported by GCC and Clang but is not standard C99. Discuss the trade-off and the alternative (using a helper function per `let`). + +```rust +fn gen_let(bindings: &[(String, Expr)], body: &[Expr]) -> String { + let mut out = String::from("({\n"); + for (name, val) in bindings { + out.push_str(&format!(" ml_int {} = {};\n", mangle(name), gen_expr(val))); + } + for expr in &body[..body.len() - 1] { + out.push_str(&format!(" {};\n", gen_stmt(expr))); + } + out.push_str(&format!(" {};\n", gen_expr(body.last().unwrap()))); + out.push_str("})"); + out +} +``` + +### Generating `begin` + +`begin` in expression position uses the C comma operator; in statement position it is a sequence of statements: + +```rust +fn gen_begin_expr(exprs: &[Expr]) -> String { + // Comma operator: (e1, e2, ..., eN) evaluates all, returns eN + let parts: Vec = exprs.iter().map(gen_expr).collect(); + format!("({})", parts.join(", ")) +} +``` + +In `gen_expr`, add: +```rust +Expr::Begin(exprs) => gen_begin_expr(exprs), +Expr::Let { bindings, body } => gen_let(bindings, body), +``` + +### Tests + +```rust +#[test] +fn test_gen_let() { + let src = "(define (f) (let ((x 1) (y 2)) (+ x y)))"; + let c = generate(parse(src).unwrap()); + assert!(c.contains("ml_int ml_x = 1")); + assert!(c.contains("ml_int ml_y = 2")); +} + +#[test] +fn test_gen_begin() { + let src = "(define (f) (begin (display 1) (display 2) 3))"; + let c = generate(parse(src).unwrap()); + assert!(c.contains("ml_display_int(1)")); + assert!(c.contains("ml_display_int(2)")); + assert!(c.contains("return 3")); +} +``` + +## Style notes + +- The expression-vs-statement distinction is the key concept here — explain it at the top before any code +- The statement expression `({...})` extension for `let` is a real trade-off — acknowledge it honestly +- The `display` type dispatch simplification should be called out clearly — readers will ask "what if I display a boolean stored in a variable?" +- End with a checkpoint: generate C for the complete factorial example; it should be correct and compilable diff --git a/edu/.nbd/tickets/e8be9a.md b/edu/.nbd/tickets/e8be9a.md new file mode 100644 index 0000000..9d5eb9b --- /dev/null +++ b/edu/.nbd/tickets/e8be9a.md @@ -0,0 +1,71 @@ ++++ +title = "§11 Exercise 4: Recommendation Engine" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ +## §11 Exercise 4 — Recommendation Engine — Stub to fill + +File: `edu/src/vector-db.md`, section `### 11. Exercise 4 — Recommendation Engine` + +Replace this stub line with the full exercise: +> **Goal:** Implement item-based collaborative filtering using vector similarity. [...] 🚧 Full content tracked in [nbd:e8be9a]. + +Follow the exercise format from `edu/src/markov.md`. + +## Goal + +Build an item-based recommendation engine. Store item feature vectors in Turso, then given a target item, find the k most similar items using KNN and exclude the query item from the results. + +## Approach + +Use hand-crafted 5-dimensional feature vectors for a product catalogue (no fastembed dependency needed — keeps focus on the recommendation logic). Dimensions represent affinity scores for: [electronics, clothing, sports, food, books]. + +## Catalogue (10 items) + +| id | name | embedding | +|---|---|---| +| 1 | "Laptop" | [0.95, 0.0, 0.1, 0.0, 0.2] | +| 2 | "Mechanical Keyboard" | [0.85, 0.0, 0.0, 0.0, 0.1] | +| 3 | "USB-C Hub" | [0.9, 0.0, 0.0, 0.0, 0.0] | +| 4 | "Running Shoes" | [0.0, 0.6, 0.9, 0.0, 0.0] | +| 5 | "Yoga Mat" | [0.0, 0.2, 0.95, 0.0, 0.0] | +| 6 | "Water Bottle" | [0.1, 0.1, 0.7, 0.0, 0.0] | +| 7 | "T-Shirt" | [0.0, 0.95, 0.1, 0.0, 0.0] | +| 8 | "Cookbook" | [0.0, 0.0, 0.0, 0.6, 0.9] | +| 9 | "Protein Bar" | [0.0, 0.0, 0.3, 0.95, 0.0] | +| 10 | "Novel" | [0.0, 0.0, 0.0, 0.1, 0.95] | + +## Steps to cover + +**Step 1 — Schema.** Table `products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, embedding F32_BLOB(5) NOT NULL)` with a `libsql_vector_idx` HNSW index. + +**Step 2 — Insert items.** Same pattern as Exercise 1: format `Vec` as JSON, `INSERT OR IGNORE`. + +**Step 3 — Recommend function.** Write a helper: + +```rust +async fn recommend( + conn: &libsql::Connection, + item_id: i64, + k: usize, +) -> Result, Box> +``` + +1. `SELECT vector_extract(embedding) FROM products WHERE id = ?` to get the query item's embedding as a JSON string +2. Pass that JSON string to `vector_top_k` with k+1 (to have room to exclude the query item) +3. JOIN to get product names and `vector_distance_cos` distances +4. Filter out `products.id = item_id` +5. Return the top k `(name, distance)` pairs + +**Step 4 — Print recommendations for three items.** +- "Laptop" → expect Mechanical Keyboard, USB-C Hub (electronics cluster) +- "Running Shoes" → expect Yoga Mat, Water Bottle (sports cluster) +- "Cookbook" → expect Novel, Protein Bar (food/books cluster) + +Output format: `"Customers who liked Laptop also liked: Mechanical Keyboard (0.023), USB-C Hub (0.041)"` + +## Reference solution + +Full `main.rs` inside `
`. The `recommend` function should be clearly separated from the setup boilerplate. The recommendation query pattern (SELECT embedding → feed as query to vector_top_k) is the key technique to highlight. \ No newline at end of file diff --git a/edu/.nbd/tickets/e8da8b.md b/edu/.nbd/tickets/e8da8b.md new file mode 100644 index 0000000..05b74aa --- /dev/null +++ b/edu/.nbd/tickets/e8da8b.md @@ -0,0 +1,77 @@ ++++ +title = "§1 Introduction: What We're Building" +priority = 5 +status = "todo" +ticket_type = "task" +dependencies = [] ++++ + +## §1 Introduction: What We're Building — Stub to fill + +File: `edu/src/lisp-compiler.md`, section `### 1. Introduction: What We're Building` + +Replace the stub line with full content. Target 600–900 words. Match the style of Section 1 in `markov.md`: motivating prose paragraphs that build genuine enthusiasm before introducing any technical detail. + +## Learning objectives + +- Understand what a compiler is and how it differs from an interpreter +- Know what MiniLisp looks like and what the compiler will produce +- Understand why Rust + nom is a good toolchain for this task +- Know what prerequisite Rust knowledge is assumed +- Have a concrete mental picture of the end goal before writing any code + +## Content to write + +**What is a compiler?** A compiler is a program that reads source code in one language and produces equivalent code in another. Unlike an interpreter (which executes source code directly), a compiler's output is a new program that can be run independently. Our compiler reads MiniLisp and writes C. That C can then be compiled by any standard C compiler (`cc`, `gcc`, `clang`) into a native binary. + +**Why Lisp?** Lisp is the ideal first compiler target. Its syntax is maximally regular — every expression is either an atom or a parenthesised list. There is no operator precedence to track, no statement/expression ambiguity, and no complex grammar rules. The AST almost directly mirrors the syntax. This regularity lets the course focus on the *concepts* of compilation rather than the incidental complexity of a messier language. + +**Why compile to C?** C is essentially portable assembly. It is available everywhere, compiles quickly, and produces fast native code. By targeting C rather than actual assembly, we get a working native compiler without managing registers, calling conventions, or instruction sets. C handles all of that. + +**A teaser: what the compiler produces.** Show a realistic MiniLisp program (recursive factorial + display call) alongside the C output the compiler will emit. This makes the goal concrete from the first page. + +MiniLisp source: +```lisp +; Compute n! +(define (factorial n) + (if (= n 0) + 1 + (* n (factorial (- n 1))))) + +(display (factorial 10)) +(newline) +``` + +Compiler output: +```c +#include +#include +#define TRUE 1 +#define FALSE 0 +typedef int64_t ml_int; + +/* forward declarations */ +ml_int ml_factorial(ml_int ml_n); + +ml_int ml_factorial(ml_int ml_n) { + return (ml_n == 0) ? 1 : (ml_n * ml_factorial((ml_n - 1))); +} + +int main(void) { + printf("%ld\n", ml_factorial(10)); + return 0; +} +``` + +**Why Rust and nom?** Rust's type system makes compiler writing unusually safe: exhaustive pattern matching means you cannot forget a case, `Result` enforces error handling at every stage, and the borrow checker prevents accidental aliasing of AST nodes. nom is a parser-combinator library — parsers are ordinary Rust functions that compose, test, and debug naturally. No grammar files, no code generation, no build scripts. + +**What this course assumes.** The reader should be comfortable with: ownership and borrowing, enums and pattern matching, `Result` and `Option`, basic trait usage (`impl Trait for Type`), and `#[test]`. No prior compiler or parsing experience is required. + +**How to follow along.** Reading sections (Parts 1, 3) can be read anywhere. Implementation sections build on each other in order — each section adds code to the same project. A reference solution appears in a collapsible block at the end of each exercise so you can verify your work without being spoiled. + +## Style notes + +- Open with a hook: writing a compiler is a rite of passage; it demystifies the tools programmers use every day +- Put the teaser code block early — visual motivation before prose motivation +- Keep the tone encouraging: this is achievable for any Rust programmer, no specialist knowledge required +- End by telling the reader exactly what they will have at the end: a binary that compiles MiniLisp programs to runnable C diff --git a/edu/TODO.md b/edu/TODO.md new file mode 100644 index 0000000..9c98b32 --- /dev/null +++ b/edu/TODO.md @@ -0,0 +1,17 @@ +# TODO + +- [ ] Host the mdbook on cloudflare pages + - [ ] Host on vibebooks.elijah.run + - [ ] Create an `infra` directory containing opentofu configs for the above + - [ ] Add a big ol' disclaimer about these being AI generated and not intended to be difinitive, trustworthy, or even good, just an experiment in generating tailored educational content about topics I am intersted in but not sure where to start, and with a practical focus on exercises with Rust since that is the language I use most often + +## Interactive Learning and Education +- [x] Git worktrees, how do use please +- [ ] How to structure a co-op profit sharing worker owned business +- [x] Hands-on: Markov Chains +- [~] Hands-on: Vector Databases +- [ ] Hands-on: Machine Learning; training a computer to play a game by playing against itself (a-la alpha go zero) +- [ ] Hands-on: Creating and training a simple LLM +- [~] Hands-on: Writing your own language (lisp, interpreted, compiled to C) +- [ ] Hands-on: Shader programming + diff --git a/edu/flake.lock b/edu/flake.lock index 79591bd..eb38968 100644 --- a/edu/flake.lock +++ b/edu/flake.lock @@ -51,11 +51,11 @@ }, "nixpkgs_2": { "locked": { - "lastModified": 1771923393, - "narHash": "sha256-Fy0+UXELv9hOE8WjYhJt8fMDLYTU2Dqn3cX4BwoGBos=", + "lastModified": 1772082373, + "narHash": "sha256-wySf8a6hvuqgFdwvvzPPTARBCMLDz7WFAufGkllD1M4=", "owner": "NixOS", "repo": "nixpkgs", - "rev": "ea7f1f06811ce7fcc81d6c6fd4213150c23edcf2", + "rev": "26eaeac4e409d7b5a6bf6f90a2a2dc223c78d915", "type": "github" }, "original": { @@ -69,7 +69,8 @@ "inputs": { "flake-utils": "flake-utils", "nbd": "nbd", - "nixpkgs": "nixpkgs_2" + "nixpkgs": "nixpkgs_2", + "rust-overlay": "rust-overlay_2" } }, "rust-overlay": { @@ -93,6 +94,26 @@ "type": "github" } }, + "rust-overlay_2": { + "inputs": { + "nixpkgs": [ + "nixpkgs" + ] + }, + "locked": { + "lastModified": 1772247314, + "narHash": "sha256-x6IFQ9bL7YYfW2m2z8D3Em2YtAA3HE8kiCFwai2fwrw=", + "owner": "oxalica", + "repo": "rust-overlay", + "rev": "a1ab5e89ab12e1a37c0b264af6386a7472d68a15", + "type": "github" + }, + "original": { + "owner": "oxalica", + "repo": "rust-overlay", + "type": "github" + } + }, "systems": { "locked": { "lastModified": 1681028828, diff --git a/edu/flake.nix b/edu/flake.nix index a0064b6..a9b6a03 100644 --- a/edu/flake.nix +++ b/edu/flake.nix @@ -4,19 +4,37 @@ inputs = { nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable"; flake-utils.url = "github:numtide/flake-utils"; + rust-overlay = { + url = "github:oxalica/rust-overlay"; + inputs.nixpkgs.follows = "nixpkgs"; + }; nbd.url = "path:../nbd"; }; - outputs = { self, nixpkgs, flake-utils, nbd }: + outputs = { self, nixpkgs, flake-utils, rust-overlay, nbd }: flake-utils.lib.eachDefaultSystem (system: let - pkgs = import nixpkgs { inherit system; }; + overlays = [ (import rust-overlay) ]; + pkgs = import nixpkgs { + inherit system overlays; + }; + + rustToolchain = pkgs.rust-bin.stable.latest.default.override { + extensions = [ "rust-src" "rust-analyzer" "clippy" "rustfmt" ]; + targets = [ "wasm32-unknown-unknown" ]; + }; in { devShells.default = pkgs.mkShell { buildInputs = [ + # Rust + rustToolchain + pkgs.mdbook + nbd.packages.${system}.nbd + + pkgs.jq ]; shellHook = '' diff --git a/edu/src/SUMMARY.md b/edu/src/SUMMARY.md index c919dc5..85b6e2a 100644 --- a/edu/src/SUMMARY.md +++ b/edu/src/SUMMARY.md @@ -4,6 +4,14 @@ - [Markov Chain Course](markov.md) +# Vector Databases + +- [Vector Database Course](vector-db.md) + # Git - [Git Worktrees](git-worktrees.md) + +# Compilers + +- [Writing a Lisp-to-C Compiler in Rust](lisp-compiler.md) diff --git a/edu/src/lisp-compiler.md b/edu/src/lisp-compiler.md new file mode 100644 index 0000000..79bcbfa --- /dev/null +++ b/edu/src/lisp-compiler.md @@ -0,0 +1,194 @@ +# Writing a Lisp-to-C Compiler in Rust + +This course walks you through building a complete, working compiler from scratch. You will write every component yourself — a lexer, a parser, a semantic analyser, and a code generator — ending with a program that reads **MiniLisp** source code and emits valid C. The compiler is written in Rust and uses the [nom](https://github.com/rust-bakery/nom) parser-combinator library for all parsing work. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket. + +--- + +## Table of Contents + +**Part 1 — Foundations** + +1. [Introduction: What We're Building](#1-introduction-what-were-building) +2. [MiniLisp Language Specification](#2-minilisp-language-specification) +3. [Compiler Architecture: The Pipeline](#3-compiler-architecture-the-pipeline) + +**Part 2 — Parsing with nom** + +4. [Introduction to nom: Parser Combinators](#4-introduction-to-nom-parser-combinators) +5. [Setting Up the Project](#5-setting-up-the-project) +6. [Recognizing Atoms: Integers, Booleans, Strings, Symbols](#6-recognizing-atoms-integers-booleans-strings-symbols) +7. [The Abstract Syntax Tree](#7-the-abstract-syntax-tree) +8. [Parsing Atoms with nom](#8-parsing-atoms-with-nom) +9. [Parsing S-Expressions and Special Forms](#9-parsing-s-expressions-and-special-forms) + +**Part 3 — Semantic Analysis** + +10. [Symbol Tables and Scope](#10-symbol-tables-and-scope) +11. [Checking Special Forms](#11-checking-special-forms) + +**Part 4 — Code Generation** + +12. [The C Runtime Preamble](#12-the-c-runtime-preamble) +13. [Generating C: Atoms and Expressions](#13-generating-c-atoms-and-expressions) +14. [Generating C: Definitions and Functions](#14-generating-c-definitions-and-functions) +15. [Generating C: Control Flow and Sequencing](#15-generating-c-control-flow-and-sequencing) + +**Part 5 — Putting It Together** + +16. [The Compilation Pipeline](#16-the-compilation-pipeline) +17. [Testing the Compiler](#17-testing-the-compiler) +18. [What's Next: Extensions and Further Reading](#18-whats-next-extensions-and-further-reading) + +--- + +## Part 1 — Foundations + +### 1. Introduction: What We're Building + +A compiler is a program that transforms source code written in one language into equivalent code in another. By the end of this course you will have written one that accepts MiniLisp — a small, clean dialect of Lisp — and produces human-readable C that you can compile and run with any standard C compiler. Along the way you will implement each classic compiler stage from scratch: lexical analysis, parsing, semantic analysis, and code generation. + +🚧 Full content tracked in [nbd:e8da8b]. + +--- + +### 2. MiniLisp Language Specification + +MiniLisp is the source language of our compiler. It is a minimal Lisp dialect with integers, booleans, strings, first-class functions, lexical scope, and a small set of built-in operators. This section defines every syntactic form precisely, gives the grammar in EBNF, and shows a complete example program so you know exactly what the compiler must handle before you write a single line of Rust. + +🚧 Full content tracked in [nbd:a93829]. + +--- + +### 3. Compiler Architecture: The Pipeline + +Our compiler is a classic multi-stage pipeline: source text passes through a parser, producing an AST; the AST passes through a semantic analyser, which validates scope and form usage; the validated AST passes through a code generator, which emits C. This section maps that pipeline onto the module structure you will build and explains how data and errors flow between stages. + +🚧 Full content tracked in [nbd:3aeb62]. + +--- + +## Part 2 — Parsing with nom + +### 4. Introduction to nom: Parser Combinators + +nom is a parser-combinator library: instead of writing a grammar file and running a generator, you write small Rust functions that each recognise a fragment of input, then combine them into larger parsers. This section introduces the core `IResult` type, walks through the essential combinators (`tag`, `char`, `alt`, `many0`, `map`, `tuple`, `delimited`, `preceded`), and shows how to write, compose, and test parsers before you apply any of this to MiniLisp. + +🚧 Full content tracked in [nbd:5835e9]. + +--- + +### 5. Setting Up the Project + +You will create a new Rust binary crate for the compiler, add nom and any other dependencies to `Cargo.toml`, and lay out the module structure that the rest of the course fills in. By the end of this section you will have a project that compiles, a `src/main.rs` that reads from stdin, and placeholder modules for each compiler stage. + +🚧 Full content tracked in [nbd:3dc36b]. + +--- + +### 6. Recognizing Atoms: Integers, Booleans, Strings, Symbols + +Before building the full parser, you need nom parsers for each atomic value in MiniLisp: signed integers, boolean literals `#t` and `#f`, double-quoted strings with escape sequences, and symbol identifiers. This section develops each atom parser in isolation, explains the nom combinators used, and provides exercises to test your understanding before the parts are assembled into the full parser. + +🚧 Full content tracked in [nbd:685f5e]. + +--- + +### 7. The Abstract Syntax Tree + +The parser's output is an **Abstract Syntax Tree** — a Rust data structure that captures the meaning of a MiniLisp program without the syntactic noise of parentheses and whitespace. This section defines the `Expr` enum and its variants, discusses why the tree is structured the way it is, and implements `Display` so you can inspect parse results during development. + +🚧 Full content tracked in [nbd:a1a827]. + +--- + +### 8. Parsing Atoms with nom + +With atom parsers and the AST defined, this section assembles them into a single `parse_atom` function that recognises any MiniLisp atom and returns the corresponding `Expr` variant. You will use `alt` to try each alternative in turn, learn how nom reports errors and how to interpret them, and write unit tests that verify correct parsing of every atom type. + +🚧 Full content tracked in [nbd:b6c9ad]. + +--- + +### 9. Parsing S-Expressions and Special Forms + +S-expressions are parenthesised lists: the heart of Lisp syntax. This section extends the parser to handle arbitrarily nested lists, whitespace between elements, and comments. It then lifts special forms — `define`, `if`, `lambda`, `let`, `begin` — out of the generic list parser so they become distinct AST variants, and covers how to handle recursive parsers in nom without running into borrow-checker problems. + +🚧 Full content tracked in [nbd:a4c9f8]. + +--- + +## Part 3 — Semantic Analysis + +### 10. Symbol Tables and Scope + +A symbol table maps names to their definitions. This section walks through a scope-aware traversal of the AST that builds a symbol table, resolves every symbol reference to its definition, and reports helpful errors for undefined names or names used outside their scope. You will implement a simple environment chain — the standard technique for representing nested lexical scopes. + +🚧 Full content tracked in [nbd:d0b9f8]. + +--- + +### 11. Checking Special Forms + +Special forms have fixed shapes: `if` needs exactly three sub-expressions; `define` needs a name and a body; `lambda` needs a parameter list and at least one body expression. This section adds arity and shape checks for each special form so that malformed programs produce clear error messages rather than mysterious C output. + +🚧 Full content tracked in [nbd:6d40a7]. + +--- + +## Part 4 — Code Generation + +### 12. The C Runtime Preamble + +Every MiniLisp program compiles to a C file that begins with a standard preamble: `#include` directives, type aliases, boolean constants, and thin wrappers for built-in operations like `display` and `newline`. This section designs the preamble, explains why each piece is there, and shows how the code generator emits it before any user-defined code. + +🚧 Full content tracked in [nbd:3e1250]. + +--- + +### 13. Generating C: Atoms and Expressions + +This section implements the expression code generator — the recursive function that turns an `Expr` into a C expression string. Integers become C integer literals; booleans become `TRUE` and `FALSE`; strings become string literals; arithmetic and comparison operations become C operators; function calls become C function-call syntax. You will also handle name-mangling: turning Lisp symbols like `my-var` into valid C identifiers. + +🚧 Full content tracked in [nbd:1eb794]. + +--- + +### 14. Generating C: Definitions and Functions + +Top-level `define` forms and `lambda` expressions compile to C function and variable declarations. This section covers how to emit forward declarations (so mutual recursion works), how to turn a MiniLisp parameter list into a C function signature, how `lambda` compiles to a named C function, and how top-level definitions are ordered in the output file. + +🚧 Full content tracked in [nbd:cbc6e3]. + +--- + +### 15. Generating C: Control Flow and Sequencing + +`if`, `begin`, and `let` each require their own code-generation strategy. `if` becomes a C ternary expression or an `if`/`else` statement depending on context; `begin` becomes a sequence of C statements with the last value forwarded; `let` introduces a C block with local variable declarations. This section works through each form and resolves the practical question of when to emit expressions versus statements. + +🚧 Full content tracked in [nbd:de82f1]. + +--- + +## Part 5 — Putting It Together + +### 16. The Compilation Pipeline + +With all stages implemented, this section wires them into a single `compile` function and builds a CLI entry point that reads MiniLisp from a file or stdin and writes C to stdout or a file. You will add basic error reporting that shows the source location of each failure and trace a complete example — a recursive factorial function — through every stage. + +🚧 Full content tracked in [nbd:58b37a]. + +--- + +### 17. Testing the Compiler + +Good tests are what turn a working prototype into a reliable tool. This section adds unit tests for each compiler stage and integration tests that compile MiniLisp programs, feed the C output to `cc`, run the binary, and assert on stdout. You will build a small test corpus of MiniLisp programs covering all language features and ensure the compiler handles both valid and invalid input gracefully. + +🚧 Full content tracked in [nbd:8fa47a]. + +--- + +### 18. What's Next: Extensions and Further Reading + +The compiler you have built is deliberately minimal — a solid foundation. This final section surveys the directions you can take it further: tail-call optimisation, closures and lambda lifting, a garbage collector, hygienic macros, a type system, an interactive REPL, and a self-hosting MiniLisp standard library. It closes with a curated reading list for going deeper into compiler theory and Lisp implementation. + +🚧 Full content tracked in [nbd:1d16da]. diff --git a/edu/src/vector-db.md b/edu/src/vector-db.md new file mode 100644 index 0000000..349a74c --- /dev/null +++ b/edu/src/vector-db.md @@ -0,0 +1,293 @@ +# Vector Database Self-Guided Course + +This document is a self-guided course on vector databases. It is organized into four parts: conceptual foundations, the internals of vector search systems, hands-on Rust exercises with Turso and sqlite-vec, and real-world application pipelines. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket — follow the ticket ID to find the detailed learning objectives and instructions. + +--- + +## Table of Contents + +**Part 1 — Foundations** + +1. [What Is a Vector?](#1-what-is-a-vector) +2. [Embeddings](#2-embeddings) +3. [Vector Similarity](#3-vector-similarity) + +**Part 2 — Vector Databases** + +4. [What Is a Vector Database?](#4-what-is-a-vector-database) +5. [Under the Hood: ANN Algorithms](#5-under-the-hood-ann-algorithms) + +**Part 3 — Turso + sqlite-vec Basics** + +6. [Setting Up](#6-setting-up) +7. [Exercise 1 — Storing and Retrieving Vectors](#7-exercise-1--storing-and-retrieving-vectors) +8. [Exercise 2 — K-Nearest Neighbor Search](#8-exercise-2--k-nearest-neighbor-search) + +**Part 4 — Real Applications** + +9. [Generating Embeddings in Rust](#9-generating-embeddings-in-rust) +10. [Exercise 3 — Semantic Document Search](#10-exercise-3--semantic-document-search) +11. [Exercise 4 — Recommendation Engine](#11-exercise-4--recommendation-engine) +12. [Exercise 5 — Retrieval-Augmented Generation](#12-exercise-5--retrieval-augmented-generation) + +--- + +## Part 1 — Foundations + +### 1. What Is a Vector? + +A **vector** is an ordered list of numbers. That is the entire definition — nothing more exotic than a list where position matters. A two-element list `[3.0, 4.0]` is a vector; so is a 1 536-element list of floating-point values produced by a language model. What makes vectors useful is that the numbers have a geometric interpretation: each element is a coordinate along one axis of a space, and the vector as a whole names a point (or an arrow from the origin to that point) in that space. + +**Geometric intuition in two and three dimensions.** Start with the familiar. A 2-dimensional vector `[x, y]` is a point in the plane — the kind you plot on graph paper. The vector `[3.0, 4.0]` sits three units to the right of the origin and four units up. An arrow drawn from `[0, 0]` to `[3, 4]` has a **magnitude** (length) of `√(3² + 4²) = 5` and points in a specific **direction**. Magnitude and direction together completely characterise the vector; change either one and you have a different vector. + +A 3-dimensional vector `[x, y, z]` extends this to physical space: three coordinates, three axes, one point. You can still compute a magnitude — `√(x² + y² + z²)` — and you can still talk about direction. Two 3D vectors point in the same direction if one is a positive scalar multiple of the other; they are **perpendicular** (orthogonal) if their dot product is zero. + +**High-dimensional spaces.** Nothing in the definition of a vector limits it to two or three elements. A *d*-dimensional vector `[x₁, x₂, …, x_d]` is a point in *d*-dimensional space. The geometry extends perfectly: magnitude is `√(x₁² + x₂² + … + x_d²)`, the dot product of two vectors is `Σᵢ aᵢ · bᵢ`, and you can compute angles and distances between points just as you would in 2D or 3D. + +High-dimensional geometry is counterintuitive in subtle ways that are worth knowing: + +- **The curse of dimensionality.** In high-dimensional spaces, most of the volume of a hypersphere is concentrated near its surface rather than its interior. Two randomly chosen high-dimensional vectors from a standard distribution tend to be nearly orthogonal — their dot product is close to zero — even when you have not deliberately constructed them that way. This means "nearest neighbour" in high dimensions is a harder problem than it sounds: there are exponentially many directions, and nearby points can seem far away using simple distance measures. + +- **Normalisation changes the geometry.** A **unit vector** has magnitude exactly 1. Dividing a vector by its magnitude — **normalisation** — projects all vectors onto the surface of the unit hypersphere. On that sphere, distance and angle are equivalent measures of similarity, which simplifies many computations. Embedding models often output unit-normalised vectors precisely to exploit this equivalence. + +- **Dimensions are not independent features.** When people say a language model embeds words into a 768-dimensional space, they do not mean "dimension 42 encodes the concept of colour." The axes of an embedding space are rarely interpretable on their own. Meaning is encoded in the *relative positions* of points — which vectors are close to which others — not in the values along any single axis. + +**Vectors as representations.** The key insight that makes vector databases useful is that real-world objects — documents, images, audio clips, products, users — can be represented as vectors such that *similarity in meaning or content corresponds to proximity in the vector space*. Two documents that discuss the same topic will, if embedded well, produce vectors that are close together. Two documents on unrelated topics will produce vectors that are far apart. + +This is not magic; it is the result of training a model to produce embeddings where similar inputs cluster near each other. Once you have such a model, every search or comparison problem reduces to a geometric problem: find the vectors closest to a query vector. The rest of this course is about how to do that efficiently at scale. + +**A note on notation.** Throughout this course, vectors are written in bold or with subscripts: **v**, **q**, or `v₁`. The *i*-th element of a vector **v** is written `v[i]` or `vᵢ`. The magnitude of **v** is written `|v|` or `‖v‖`. Dimension is written *d* and the number of stored vectors is written *n*. + +--- + +### 2. Embeddings + +Embeddings are the bridge between raw data and vector space. This section covers how language models, image encoders, and other neural networks learn to map heterogeneous inputs — words, sentences, images, products — into vectors where geometric proximity captures semantic similarity. 🚧 Full content tracked in [nbd:584e0c]. + +--- + +### 3. Vector Similarity + +Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: **cosine similarity**, **dot product**, and **Euclidean distance** — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9]. + +--- + +## Part 2 — Vector Databases + +### 4. What Is a Vector Database? + +A vector database is a data store built around one core operation: given a query vector **q**, return the *k* stored vectors most similar to **q**. This section covers what that means in practice — approximate nearest-neighbour (ANN) search, the use cases that make vector databases essential (semantic search, recommendations, RAG), and how they differ from traditional relational or key-value databases. 🚧 Full content tracked in [nbd:d9f850]. + +--- + +### 5. Under the Hood: ANN Algorithms + +Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for production use. This section explains the two dominant approximate methods — **HNSW** (Hierarchical Navigable Small World graphs) and **IVFFlat** (Inverted File with flat quantisation) — their index construction, query-time traversal, and the recall vs. latency trade-off each exposes. 🚧 Full content tracked in [nbd:6ec5ff]. + +--- + +## Part 3 — Turso + sqlite-vec Basics + +### 6. Setting Up + +This section walks through everything you need before writing a single SQL query: adding the right crates, opening a local Turso connection, and loading the `sqlite-vec` extension that gives SQLite vector-search superpowers. + +#### What You Are Building + +Turso is a SQLite-compatible database with built-in support for vector similarity search via the `sqlite-vec` extension. In local development you use a file-backed SQLite database; in production the same code points at a Turso cloud database. The `libsql` crate (the Rust client for Turso) speaks the Turso wire protocol and also handles local SQLite files transparently. + +#### Cargo.toml + +Create a new binary project and add the following dependencies: + +```sh +cargo new vec-demo +cd vec-demo +``` + +Replace the `[dependencies]` section of `Cargo.toml` with: + +```toml +[dependencies] +libsql = "0.9" +tokio = { version = "1", features = ["full"] } +``` + +`libsql` is the official Rust client for Turso / libSQL databases. It supports both local SQLite files and remote Turso connections with the same API, making it straightforward to develop locally and deploy to the cloud. `tokio` provides the async runtime — all `libsql` operations are `async`. + +Add the release-build optimisation profile from the project conventions: + +```toml +[profile.release] +opt-level = "z" +lto = true +strip = true +codegen-units = 1 +``` + +#### Opening a Local Connection + +Replace `src/main.rs` with the following: + +```rust +use libsql::{Builder, Database}; + +#[tokio::main] +async fn main() -> Result<(), Box> { + let db: Database = Builder::new_local("vectors.db").build().await?; + let conn = db.connect()?; + + // Verify the connection works + let mut rows = conn.query("SELECT sqlite_version()", ()).await?; + if let Some(row) = rows.next().await? { + let version: String = row.get(0)?; + println!("SQLite version: {version}"); + } + + Ok(()) +} +``` + +Run it with `cargo run`. You should see output like: + +``` +SQLite version: 3.46.0 +``` + +A file named `vectors.db` will appear in the current directory. This is a standard SQLite database — you can open it with any SQLite client to inspect its contents. + +#### Enabling Vector Support with sqlite-vec + +The `libsql` crate ships with `sqlite-vec` built in. No separate installation is required. Vector functions become available automatically once you use the right column types and functions in your SQL. + +The key types and functions you will use throughout this course: + +| Construct | Purpose | +|---|---| +| `F32_BLOB(d)` | Column type for storing a *d*-dimensional float32 vector | +| `vector(json_array)` | Creates a vector from a JSON array literal | +| `vector_extract(blob)` | Converts a stored vector blob back to a JSON array | +| `vector_distance_cos(a, b)` | Cosine distance between two vectors (0 = identical, 2 = opposite) | +| `libsql_vector_idx(col)` | Index type for fast approximate nearest-neighbour search | +| `vector_top_k(table, query, k)` | Table-valued function: returns the *k* nearest rows to a query vector | + +#### Creating a Vector Table + +Extend `main` to create a table that stores 3-dimensional float32 vectors: + +```rust +conn.execute( + "CREATE TABLE IF NOT EXISTS items ( + id INTEGER PRIMARY KEY, + label TEXT NOT NULL, + embedding F32_BLOB(3) NOT NULL + )", + (), +).await?; +``` + +`F32_BLOB(3)` declares a column that holds a 3-dimensional float32 vector stored as a binary blob. The `3` is the dimensionality — use the actual size of your embedding model's output (e.g., `F32_BLOB(768)` for a 768-dimensional model) in real projects. + +#### Creating a Vector Index + +Without an index, nearest-neighbour search performs a full table scan — computing the distance from the query to every stored vector. For small tables this is fine; at scale you need an index: + +```rust +conn.execute( + "CREATE INDEX IF NOT EXISTS items_vec_idx + ON items (embedding) + USING libsql_vector_idx(embedding)", + (), +).await?; +``` + +This creates an HNSW index over the `embedding` column. Queries that use `vector_top_k` will automatically use this index. The index is updated incrementally as rows are inserted or deleted — no manual rebuild is required. + +#### Putting It Together + +At this point your `main.rs` should look like this: + +```rust +use libsql::{Builder, Database}; + +#[tokio::main] +async fn main() -> Result<(), Box> { + let db: Database = Builder::new_local("vectors.db").build().await?; + let conn = db.connect()?; + + // Verify connection + let mut rows = conn.query("SELECT sqlite_version()", ()).await?; + if let Some(row) = rows.next().await? { + let version: String = row.get(0)?; + println!("SQLite version: {version}"); + } + + // Create vector table + conn.execute( + "CREATE TABLE IF NOT EXISTS items ( + id INTEGER PRIMARY KEY, + label TEXT NOT NULL, + embedding F32_BLOB(3) NOT NULL + )", + (), + ).await?; + + // Create HNSW index + conn.execute( + "CREATE INDEX IF NOT EXISTS items_vec_idx + ON items (embedding) + USING libsql_vector_idx(embedding)", + (), + ).await?; + + println!("Database ready."); + Ok(()) +} +``` + +`cargo run` should print: + +``` +SQLite version: 3.46.0 +Database ready. +``` + +You now have a working local vector database. Exercises 1 through 5 build on this foundation, adding data, querying it, and connecting the full embedding-to-search pipeline. + +--- + +### 7. Exercise 1 — Storing and Retrieving Vectors + +**Goal:** Insert a small set of labelled vectors into the `items` table created in §6, then retrieve them with a `SELECT` and deserialize the stored blob back into a Rust `Vec`. 🚧 Full content tracked in [nbd:081a55]. + +--- + +### 8. Exercise 2 — K-Nearest Neighbor Search + +**Goal:** Use `vector_top_k` and `vector_distance_cos` to find the *k* vectors in the database most similar to a query vector, and display the results ranked by similarity score. 🚧 Full content tracked in [nbd:5674ce]. + +--- + +## Part 4 — Real Applications + +### 9. Generating Embeddings in Rust + +Before you can search by meaning, you need a way to convert text into vectors. This section covers two approaches available in Rust: running a local embedding model with `fastembed-rs` (no API key, works offline, suited for smaller models) and calling an HTTP embedding API such as the OpenAI Embeddings endpoint (larger, higher-quality models at the cost of latency and a network dependency). 🚧 Full content tracked in [nbd:4c961f]. + +--- + +### 10. Exercise 3 — Semantic Document Search + +**Goal:** Build a complete semantic search pipeline: embed a small corpus of text documents, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-*k* most relevant documents using vector similarity — all without any keyword matching. 🚧 Full content tracked in [nbd:1ef9f4]. + +--- + +### 11. Exercise 4 — Recommendation Engine + +**Goal:** Implement item-based collaborative filtering using vector similarity. Store item feature vectors (or learned item embeddings) in Turso, then given a target item, retrieve the *k* most similar items as recommendations. 🚧 Full content tracked in [nbd:e8be9a]. + +--- + +### 12. Exercise 5 — Retrieval-Augmented Generation + +**Goal:** Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model's grounded answer back to the user. 🚧 Full content tracked in [nbd:5ed295].