docs(edu): add lisp-to-C compiler course with stubs and tickets [67e284]
18-section interactive course teaching compiler construction in Rust using nom. Covers MiniLisp parsing, AST design, semantic analysis, and C code generation. All sections stubbed; one nbd task ticket per section plus a project ticket (67e284) tracking completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>quotesdb
parent
b8a3b9ddfe
commit
955cf029ab
@ -0,0 +1,74 @@
|
|||||||
|
+++
|
||||||
|
title = "§7 Exercise 1: Storing and Retrieving Vectors"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §7 Exercise 1 — Storing and Retrieving Vectors — Stub to fill
|
||||||
|
|
||||||
|
File: `edu/src/vector-db.md`, section `### 7. Exercise 1 — Storing and Retrieving Vectors`
|
||||||
|
|
||||||
|
Replace this stub line with the full exercise:
|
||||||
|
> **Goal:** Insert a small set of labelled vectors [...] 🚧 Full content tracked in [nbd:081a55].
|
||||||
|
|
||||||
|
Follow the exercise format from `edu/src/markov.md`: Goal, Setup, Starter Code skeleton, numbered Steps, Reference Solution in `<details><summary>Show full solution</summary>`.
|
||||||
|
|
||||||
|
## Prerequisites (established in §6)
|
||||||
|
|
||||||
|
Reader has the `vec-demo` project with `libsql = "0.9"` and `tokio`. The `main` function opens a local connection via `Builder::new_local("vectors.db").build().await?` and has already created the `items` table (`id INTEGER PRIMARY KEY, label TEXT NOT NULL, embedding F32_BLOB(3) NOT NULL`) and the HNSW index.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Insert 6 labelled 3-dimensional vectors, then SELECT all rows and print each label alongside its deserialized `Vec<f32>`.
|
||||||
|
|
||||||
|
## Vectors to use
|
||||||
|
|
||||||
|
| id | label | embedding |
|
||||||
|
|---|---|---|
|
||||||
|
| 1 | "cat" | [0.9, 0.1, 0.2] |
|
||||||
|
| 2 | "dog" | [0.8, 0.2, 0.3] |
|
||||||
|
| 3 | "car" | [0.1, 0.9, 0.1] |
|
||||||
|
| 4 | "truck" | [0.2, 0.8, 0.2] |
|
||||||
|
| 5 | "python" | [0.15, 0.1, 0.95] |
|
||||||
|
| 6 | "rust" | [0.1, 0.05, 0.9] |
|
||||||
|
|
||||||
|
Hand-crafted so animals cluster near [high, low, low], vehicles near [low, high, low], and programming languages near [low, low, high]. The §8 KNN exercise uses these clusters to verify correct nearest-neighbour results.
|
||||||
|
|
||||||
|
## Steps to cover
|
||||||
|
|
||||||
|
**Step 1 — Formatting a vector for INSERT.** Explain that `vector(?)` in SQL accepts a JSON array string. Show how to format a `Vec<f32>` in Rust:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
fn vec_to_json(v: &[f32]) -> String {
|
||||||
|
format!("[{}]", v.iter().map(|x| x.to_string()).collect::<Vec<_>>().join(","))
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2 — Inserting rows.** Use `INSERT OR IGNORE` so re-running the program is idempotent:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
INSERT OR IGNORE INTO items (id, label, embedding) VALUES (?, ?, vector(?))
|
||||||
|
```
|
||||||
|
|
||||||
|
Loop over a `Vec<(i64, &str, Vec<f32>)>` and call `conn.execute` for each row, passing id, label, and the JSON string as parameters.
|
||||||
|
|
||||||
|
**Step 3 — Selecting and deserialising.** Query all rows:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT id, label, vector_extract(embedding) FROM items ORDER BY id
|
||||||
|
```
|
||||||
|
|
||||||
|
`vector_extract` returns a JSON array string (e.g. `"[0.9,0.1,0.2]"`). Add `serde_json = "1"` to Cargo.toml and parse it: `serde_json::from_str::<Vec<f32>>(&json_str)?`.
|
||||||
|
|
||||||
|
**Step 4 — Print results.** Format output as:
|
||||||
|
|
||||||
|
```
|
||||||
|
1 cat [0.9, 0.1, 0.2]
|
||||||
|
2 dog [0.8, 0.2, 0.3]
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cargo.toml additions
|
||||||
|
|
||||||
|
Add `serde_json = "1"` for JSON array parsing.
|
||||||
@ -0,0 +1,80 @@
|
|||||||
|
+++
|
||||||
|
title = "§10 Exercise 3: Semantic Document Search"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §10 Exercise 3 — Semantic Document Search — Stub to fill
|
||||||
|
|
||||||
|
File: `edu/src/vector-db.md`, section `### 10. Exercise 3 — Semantic Document Search`
|
||||||
|
|
||||||
|
Replace this stub line with the full exercise:
|
||||||
|
> **Goal:** Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4].
|
||||||
|
|
||||||
|
Follow the exercise format from `edu/src/markov.md`. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching.
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
New project or extend vec-demo. Cargo.toml:
|
||||||
|
```toml
|
||||||
|
[dependencies]
|
||||||
|
libsql = "0.9"
|
||||||
|
fastembed = "4"
|
||||||
|
tokio = { version = "1", features = ["full"] }
|
||||||
|
```
|
||||||
|
|
||||||
|
Table schema uses `F32_BLOB(384)` (BGE-Small-EN-v1.5 output dimension):
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS docs (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
passage TEXT NOT NULL,
|
||||||
|
embedding F32_BLOB(384) NOT NULL
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Corpus to use (15 passages across 3 topics)
|
||||||
|
|
||||||
|
**Rust programming (5):**
|
||||||
|
- "Rust uses an ownership system to guarantee memory safety without a garbage collector."
|
||||||
|
- "The borrow checker enforces that references do not outlive the data they point to."
|
||||||
|
- "Cargo is Rust's build system and package manager, used to manage dependencies and run tests."
|
||||||
|
- "Rust's trait system enables zero-cost abstractions and compile-time polymorphism."
|
||||||
|
- "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently."
|
||||||
|
|
||||||
|
**Astronomy (5):**
|
||||||
|
- "A black hole is a region of spacetime where gravity is so strong that nothing can escape."
|
||||||
|
- "The Milky Way galaxy contains an estimated 100 to 400 billion stars."
|
||||||
|
- "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei."
|
||||||
|
- "The cosmic microwave background is the thermal radiation left over from the early universe."
|
||||||
|
- "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity."
|
||||||
|
|
||||||
|
**Cooking (5):**
|
||||||
|
- "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions."
|
||||||
|
- "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures."
|
||||||
|
- "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin."
|
||||||
|
- "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt."
|
||||||
|
- "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables."
|
||||||
|
|
||||||
|
## Steps to cover
|
||||||
|
|
||||||
|
**Step 1 — Embed the corpus.** Use fastembed to produce a `Vec<Vec<f32>>` for all 15 passages in one `model.embed()` call (batch is more efficient than one-at-a-time).
|
||||||
|
|
||||||
|
**Step 2 — Insert into Turso.** Loop over passages and embeddings together. Format `Vec<f32>` as JSON string for `vector(?)`. Use `INSERT OR IGNORE` so re-running is idempotent.
|
||||||
|
|
||||||
|
**Step 3 — Embed the query and search.** Embed a single query string (same model, `model.embed(vec![query], None)?`), then run `vector_top_k` with k=5 and join to get passage text and cosine distance.
|
||||||
|
|
||||||
|
**Step 4 — Run three queries and verify results.** Verify the correct topic cluster surfaces:
|
||||||
|
- `"memory safety in systems programming"` → Rust passages
|
||||||
|
- `"stars and galaxies"` → astronomy passages
|
||||||
|
- `"fermentation and cooking techniques"` → cooking passages
|
||||||
|
|
||||||
|
Print results ranked by distance with the passage text.
|
||||||
|
|
||||||
|
## Reference solution
|
||||||
|
|
||||||
|
Full self-contained `main.rs` inside `<details>`: creates table, embeds and inserts all 15 passages, runs three queries, prints results.
|
||||||
@ -0,0 +1,18 @@
|
|||||||
|
+++
|
||||||
|
title = "§1 What Is a Vector?"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §1 What Is a Vector? — ALREADY COMPLETE
|
||||||
|
|
||||||
|
This section is fully written in `edu/src/vector-db.md` under `### 1. What Is a Vector?`. No further content work is needed. Mark this ticket done.
|
||||||
|
|
||||||
|
## What was written
|
||||||
|
|
||||||
|
- Definition: a vector is an ordered list of numbers; each element is a coordinate along one axis of a space
|
||||||
|
- Geometric intuition in 2D and 3D: magnitude (`‖v‖ = √(Σ vᵢ²)`), direction, orthogonality via dot product
|
||||||
|
- High-dimensional spaces: the curse of dimensionality, normalisation onto the unit hypersphere, dimensions are not individually interpretable features
|
||||||
|
- Vectors as representations: the key insight that similarity in meaning corresponds to proximity in vector space — the basis of everything that follows
|
||||||
|
- Notation used throughout the course: **v**, `v[i]`, `‖v‖`, *d* for dimension, *n* for number of stored vectors
|
||||||
@ -0,0 +1,21 @@
|
|||||||
|
+++
|
||||||
|
title = "§6 Setting Up Turso + sqlite-vec"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §6 Setting Up Turso + sqlite-vec — ALREADY COMPLETE
|
||||||
|
|
||||||
|
This section is fully written in `edu/src/vector-db.md` under `### 6. Setting Up`. No further content work is needed. Mark this ticket done.
|
||||||
|
|
||||||
|
## What was written
|
||||||
|
|
||||||
|
- Project setup: `cargo new vec-demo`, `libsql = "0.9"`, `tokio` with full features
|
||||||
|
- Release profile: `opt-level = "z"`, `lto = true`, `strip = true`, `codegen-units = 1`
|
||||||
|
- Opening a local connection: `Builder::new_local("vectors.db").build().await?`
|
||||||
|
- Verifying the connection with `SELECT sqlite_version()`
|
||||||
|
- Reference table of all sqlite-vec constructs used in later exercises: `F32_BLOB(d)`, `vector()`, `vector_extract()`, `vector_distance_cos()`, `libsql_vector_idx`, `vector_top_k()`
|
||||||
|
- Creating a `F32_BLOB(3)` vector table
|
||||||
|
- Creating an HNSW index with `USING libsql_vector_idx(embedding)`
|
||||||
|
- Complete working `main.rs` listing — produces "SQLite version: x.y.z" and "Database ready."
|
||||||
@ -0,0 +1,65 @@
|
|||||||
|
+++
|
||||||
|
title = "§8 Exercise 2: K-Nearest Neighbor Search"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §8 Exercise 2 — K-Nearest Neighbor Search — Stub to fill
|
||||||
|
|
||||||
|
File: `edu/src/vector-db.md`, section `### 8. Exercise 2 — K-Nearest Neighbor Search`
|
||||||
|
|
||||||
|
Replace this stub line with the full exercise:
|
||||||
|
> **Goal:** Use `vector_top_k` and `vector_distance_cos` [...] 🚧 Full content tracked in [nbd:5674ce].
|
||||||
|
|
||||||
|
Follow the exercise format from `edu/src/markov.md`.
|
||||||
|
|
||||||
|
## Prerequisites (established in §7)
|
||||||
|
|
||||||
|
Reader has the `vec-demo` project and has 6 rows in the `items` table: cat, dog, car, truck, python, rust with 3-dimensional embeddings.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Given a query vector, use `vector_top_k` to find the 3 most similar items, join with the `items` table to retrieve labels and exact cosine distances, and display the results ranked by distance.
|
||||||
|
|
||||||
|
## Steps to cover
|
||||||
|
|
||||||
|
**Step 1 — Introduce `vector_top_k`.** Explain that this is a table-valued function (TVF) that returns row IDs of approximate nearest neighbours without a full table scan. Syntax:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT i.rowid FROM vector_top_k('items', vector(?), ?) i
|
||||||
|
```
|
||||||
|
|
||||||
|
The first argument is the table name (string literal), second is the query vector, third is k. Returns `rowid` values only — join to get other columns.
|
||||||
|
|
||||||
|
**Step 2 — Full KNN query.** Show the complete query combining the TVF with a JOIN and exact distance computation:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT items.id, items.label, vector_distance_cos(items.embedding, vector(?)) AS dist
|
||||||
|
FROM vector_top_k('items', vector(?), ?) AS knn
|
||||||
|
JOIN items ON items.rowid = knn.rowid
|
||||||
|
ORDER BY dist ASC
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: the query vector must be passed twice — once for `vector_top_k` (index traversal) and once for `vector_distance_cos` (exact distance). Both are the same JSON array string.
|
||||||
|
|
||||||
|
**Step 3 — Run three queries and print results.**
|
||||||
|
|
||||||
|
Query vectors to use:
|
||||||
|
- `[0.85, 0.15, 0.25]` → should be nearest cat and dog (animal cluster)
|
||||||
|
- `[0.15, 0.85, 0.15]` → should be nearest car and truck (vehicle cluster)
|
||||||
|
- `[0.1, 0.05, 0.92]` → should be nearest rust and python (language cluster)
|
||||||
|
|
||||||
|
Expected output format:
|
||||||
|
```
|
||||||
|
Query: [0.85, 0.15, 0.25]
|
||||||
|
1. cat dist=0.0023
|
||||||
|
2. dog dist=0.0089
|
||||||
|
3. python dist=0.1834
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4 — Explain ANN vs. exact search.** For 6 rows, `vector_top_k` falls back to exact search anyway — the HNSW index has too few nodes to offer a shortcut. Note that at scale (millions of rows), it returns approximate results; some true nearest neighbours may be missed. `vector_distance_cos` always gives the exact distance for any specific pair.
|
||||||
|
|
||||||
|
## Reference solution
|
||||||
|
|
||||||
|
Full `main.rs` inside `<details><summary>Show full solution</summary>`. The solution should re-run setup from §7 (create table, insert data) then run the three KNN queries.
|
||||||
@ -0,0 +1,54 @@
|
|||||||
|
+++
|
||||||
|
title = "Course: Writing a Lisp-to-C Compiler in Rust"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "project"
|
||||||
|
dependencies = ["e8da8b", "a93829", "3aeb62", "5835e9", "3dc36b", "685f5e", "a1a827", "b6c9ad", "a4c9f8", "d0b9f8", "6d40a7", "3e1250", "1eb794", "cbc6e3", "de82f1", "58b37a", "8fa47a", "1d16da"]
|
||||||
|
+++
|
||||||
|
|
||||||
|
## Course: Writing a Lisp-to-C Compiler in Rust
|
||||||
|
|
||||||
|
A complete self-guided interactive course teaching how to build a compiler from scratch in Rust using the nom parser-combinator library. The source language is MiniLisp — a minimal Lisp dialect. The compilation target is human-readable C.
|
||||||
|
|
||||||
|
## Course file
|
||||||
|
|
||||||
|
`edu/src/lisp-compiler.md`
|
||||||
|
|
||||||
|
## Section inventory
|
||||||
|
|
||||||
|
| § | Title | Ticket |
|
||||||
|
|---|---|---|
|
||||||
|
| 1 | Introduction: What We're Building | e8da8b |
|
||||||
|
| 2 | MiniLisp Language Specification | a93829 |
|
||||||
|
| 3 | Compiler Architecture: The Pipeline | 3aeb62 |
|
||||||
|
| 4 | Introduction to nom: Parser Combinators | 5835e9 |
|
||||||
|
| 5 | Setting Up the Project | 3dc36b |
|
||||||
|
| 6 | Recognizing Atoms: Integers, Booleans, Strings, Symbols | 685f5e |
|
||||||
|
| 7 | The Abstract Syntax Tree | a1a827 |
|
||||||
|
| 8 | Parsing Atoms with nom | b6c9ad |
|
||||||
|
| 9 | Parsing S-Expressions and Special Forms | a4c9f8 |
|
||||||
|
| 10 | Symbol Tables and Scope | d0b9f8 |
|
||||||
|
| 11 | Checking Special Forms | 6d40a7 |
|
||||||
|
| 12 | The C Runtime Preamble | 3e1250 |
|
||||||
|
| 13 | Generating C: Atoms and Expressions | 1eb794 |
|
||||||
|
| 14 | Generating C: Definitions and Functions | cbc6e3 |
|
||||||
|
| 15 | Generating C: Control Flow and Sequencing | de82f1 |
|
||||||
|
| 16 | The Compilation Pipeline | 58b37a |
|
||||||
|
| 17 | Testing the Compiler | 8fa47a |
|
||||||
|
| 18 | What's Next: Extensions and Further Reading | 1d16da |
|
||||||
|
|
||||||
|
## MiniLisp feature set
|
||||||
|
|
||||||
|
- **Types:** integers (`int64_t`), booleans (`#t`/`#f`), strings (`const char*`)
|
||||||
|
- **Special forms:** `define`, `lambda`, `if`, `let`, `begin`
|
||||||
|
- **Built-in operators:** `+`, `-`, `*`, `/`, `=`, `<`, `>`, `<=`, `>=`, `not`
|
||||||
|
- **Built-in functions:** `display`, `newline`, `error`
|
||||||
|
- **Comments:** `;` to end of line
|
||||||
|
|
||||||
|
## Explicitly out of scope
|
||||||
|
|
||||||
|
Closures, tail-call optimisation, pairs/lists, garbage collection, macros, variadic functions, floating-point. Each is discussed as a potential extension in §18.
|
||||||
|
|
||||||
|
## Definition of done
|
||||||
|
|
||||||
|
All 18 section tickets are `done` and `mdbook build` succeeds with no 🚧 stubs remaining in the output.
|
||||||
@ -0,0 +1,50 @@
|
|||||||
|
+++
|
||||||
|
title = "vector-db"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "project"
|
||||||
|
dependencies = ["21d9be", "584e0c", "99e1d9", "d9f850", "6ec5ff", "37cdd5", "081a55", "5674ce", "4c961f", "1ef9f4", "e8be9a", "5ed295"]
|
||||||
|
+++
|
||||||
|
## Project: Vector Database Self-Guided Course
|
||||||
|
|
||||||
|
This is the top-level project ticket for `edu/src/vector-db.md` — a self-guided mdbook course on vector databases in the **Vibed Learning** site (`edu/`).
|
||||||
|
|
||||||
|
The course is modelled on `edu/src/markov.md`. It teaches vector databases through 12 sections across 4 parts, mixing reading lessons and hands-on Rust programming exercises using Turso (`libsql` crate) and sqlite-vec for local vector storage.
|
||||||
|
|
||||||
|
## Course structure
|
||||||
|
|
||||||
|
| # | Title | Status |
|
||||||
|
|---|---|---|
|
||||||
|
| §1 | What Is a Vector? | Written in full |
|
||||||
|
| §2 | Embeddings | Stub [nbd:584e0c] |
|
||||||
|
| §3 | Vector Similarity | Stub [nbd:99e1d9] |
|
||||||
|
| §4 | What Is a Vector Database? | Stub [nbd:d9f850] |
|
||||||
|
| §5 | Under the Hood: ANN Algorithms | Stub [nbd:6ec5ff] |
|
||||||
|
| §6 | Setting Up | Written in full |
|
||||||
|
| §7 | Exercise 1 — Storing and Retrieving Vectors | Stub [nbd:081a55] |
|
||||||
|
| §8 | Exercise 2 — K-Nearest Neighbor Search | Stub [nbd:5674ce] |
|
||||||
|
| §9 | Generating Embeddings in Rust | Stub [nbd:4c961f] |
|
||||||
|
| §10 | Exercise 3 — Semantic Document Search | Stub [nbd:1ef9f4] |
|
||||||
|
| §11 | Exercise 4 — Recommendation Engine | Stub [nbd:e8be9a] |
|
||||||
|
| §12 | Exercise 5 — Retrieval-Augmented Generation | Stub [nbd:5ed295] |
|
||||||
|
|
||||||
|
## Filling a stub
|
||||||
|
|
||||||
|
1. Open `edu/src/vector-db.md`
|
||||||
|
2. Find the section (e.g. `### 2. Embeddings`)
|
||||||
|
3. Replace the stub line (`🚧 Full content tracked in [nbd:...]`) with full content
|
||||||
|
4. Run `mdbook build` from `edu/` — must pass cleanly
|
||||||
|
5. Mark the section ticket done
|
||||||
|
|
||||||
|
## Tech stack used in exercises
|
||||||
|
|
||||||
|
- **Runtime:** Tokio async
|
||||||
|
- **DB crate:** `libsql = "0.9"` (Turso / libSQL Rust client)
|
||||||
|
- **Vector support:** sqlite-vec, built into libsql — no extra install
|
||||||
|
- **Embeddings:** `fastembed` crate (local) or OpenAI-compatible HTTP API
|
||||||
|
- **Local connection:** `Builder::new_local("vectors.db").build().await?`
|
||||||
|
- **Vector column type:** `F32_BLOB(d)` where d = embedding dimension
|
||||||
|
- **KNN query:** `vector_top_k('table', vector('[...]'), k)` table-valued function
|
||||||
|
- **Distance function:** `vector_distance_cos(a, b)` — 0 = identical, 2 = opposite
|
||||||
|
|
||||||
|
This project ticket closes when all 12 section tickets are done and `mdbook build` passes.
|
||||||
@ -0,0 +1,71 @@
|
|||||||
|
+++
|
||||||
|
title = "§11 Exercise 4: Recommendation Engine"
|
||||||
|
priority = 5
|
||||||
|
status = "todo"
|
||||||
|
ticket_type = "task"
|
||||||
|
dependencies = []
|
||||||
|
+++
|
||||||
|
## §11 Exercise 4 — Recommendation Engine — Stub to fill
|
||||||
|
|
||||||
|
File: `edu/src/vector-db.md`, section `### 11. Exercise 4 — Recommendation Engine`
|
||||||
|
|
||||||
|
Replace this stub line with the full exercise:
|
||||||
|
> **Goal:** Implement item-based collaborative filtering using vector similarity. [...] 🚧 Full content tracked in [nbd:e8be9a].
|
||||||
|
|
||||||
|
Follow the exercise format from `edu/src/markov.md`.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Build an item-based recommendation engine. Store item feature vectors in Turso, then given a target item, find the k most similar items using KNN and exclude the query item from the results.
|
||||||
|
|
||||||
|
## Approach
|
||||||
|
|
||||||
|
Use hand-crafted 5-dimensional feature vectors for a product catalogue (no fastembed dependency needed — keeps focus on the recommendation logic). Dimensions represent affinity scores for: [electronics, clothing, sports, food, books].
|
||||||
|
|
||||||
|
## Catalogue (10 items)
|
||||||
|
|
||||||
|
| id | name | embedding |
|
||||||
|
|---|---|---|
|
||||||
|
| 1 | "Laptop" | [0.95, 0.0, 0.1, 0.0, 0.2] |
|
||||||
|
| 2 | "Mechanical Keyboard" | [0.85, 0.0, 0.0, 0.0, 0.1] |
|
||||||
|
| 3 | "USB-C Hub" | [0.9, 0.0, 0.0, 0.0, 0.0] |
|
||||||
|
| 4 | "Running Shoes" | [0.0, 0.6, 0.9, 0.0, 0.0] |
|
||||||
|
| 5 | "Yoga Mat" | [0.0, 0.2, 0.95, 0.0, 0.0] |
|
||||||
|
| 6 | "Water Bottle" | [0.1, 0.1, 0.7, 0.0, 0.0] |
|
||||||
|
| 7 | "T-Shirt" | [0.0, 0.95, 0.1, 0.0, 0.0] |
|
||||||
|
| 8 | "Cookbook" | [0.0, 0.0, 0.0, 0.6, 0.9] |
|
||||||
|
| 9 | "Protein Bar" | [0.0, 0.0, 0.3, 0.95, 0.0] |
|
||||||
|
| 10 | "Novel" | [0.0, 0.0, 0.0, 0.1, 0.95] |
|
||||||
|
|
||||||
|
## Steps to cover
|
||||||
|
|
||||||
|
**Step 1 — Schema.** Table `products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, embedding F32_BLOB(5) NOT NULL)` with a `libsql_vector_idx` HNSW index.
|
||||||
|
|
||||||
|
**Step 2 — Insert items.** Same pattern as Exercise 1: format `Vec<f32>` as JSON, `INSERT OR IGNORE`.
|
||||||
|
|
||||||
|
**Step 3 — Recommend function.** Write a helper:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
async fn recommend(
|
||||||
|
conn: &libsql::Connection,
|
||||||
|
item_id: i64,
|
||||||
|
k: usize,
|
||||||
|
) -> Result<Vec<(String, f64)>, Box<dyn std::error::Error>>
|
||||||
|
```
|
||||||
|
|
||||||
|
1. `SELECT vector_extract(embedding) FROM products WHERE id = ?` to get the query item's embedding as a JSON string
|
||||||
|
2. Pass that JSON string to `vector_top_k` with k+1 (to have room to exclude the query item)
|
||||||
|
3. JOIN to get product names and `vector_distance_cos` distances
|
||||||
|
4. Filter out `products.id = item_id`
|
||||||
|
5. Return the top k `(name, distance)` pairs
|
||||||
|
|
||||||
|
**Step 4 — Print recommendations for three items.**
|
||||||
|
- "Laptop" → expect Mechanical Keyboard, USB-C Hub (electronics cluster)
|
||||||
|
- "Running Shoes" → expect Yoga Mat, Water Bottle (sports cluster)
|
||||||
|
- "Cookbook" → expect Novel, Protein Bar (food/books cluster)
|
||||||
|
|
||||||
|
Output format: `"Customers who liked Laptop also liked: Mechanical Keyboard (0.023), USB-C Hub (0.041)"`
|
||||||
|
|
||||||
|
## Reference solution
|
||||||
|
|
||||||
|
Full `main.rs` inside `<details>`. The `recommend` function should be clearly separated from the setup boilerplate. The recommendation query pattern (SELECT embedding → feed as query to vector_top_k) is the key technique to highlight.
|
||||||
@ -0,0 +1,17 @@
|
|||||||
|
# TODO
|
||||||
|
|
||||||
|
- [ ] Host the mdbook on cloudflare pages
|
||||||
|
- [ ] Host on vibebooks.elijah.run
|
||||||
|
- [ ] Create an `infra` directory containing opentofu configs for the above
|
||||||
|
- [ ] Add a big ol' disclaimer about these being AI generated and not intended to be difinitive, trustworthy, or even good, just an experiment in generating tailored educational content about topics I am intersted in but not sure where to start, and with a practical focus on exercises with Rust since that is the language I use most often
|
||||||
|
|
||||||
|
## Interactive Learning and Education
|
||||||
|
- [x] Git worktrees, how do use please
|
||||||
|
- [ ] How to structure a co-op profit sharing worker owned business
|
||||||
|
- [x] Hands-on: Markov Chains
|
||||||
|
- [~] Hands-on: Vector Databases
|
||||||
|
- [ ] Hands-on: Machine Learning; training a computer to play a game by playing against itself (a-la alpha go zero)
|
||||||
|
- [ ] Hands-on: Creating and training a simple LLM
|
||||||
|
- [~] Hands-on: Writing your own language (lisp, interpreted, compiled to C)
|
||||||
|
- [ ] Hands-on: Shader programming
|
||||||
|
|
||||||
@ -0,0 +1,194 @@
|
|||||||
|
# Writing a Lisp-to-C Compiler in Rust
|
||||||
|
|
||||||
|
This course walks you through building a complete, working compiler from scratch. You will write every component yourself — a lexer, a parser, a semantic analyser, and a code generator — ending with a program that reads **MiniLisp** source code and emits valid C. The compiler is written in Rust and uses the [nom](https://github.com/rust-bakery/nom) parser-combinator library for all parsing work. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
**Part 1 — Foundations**
|
||||||
|
|
||||||
|
1. [Introduction: What We're Building](#1-introduction-what-were-building)
|
||||||
|
2. [MiniLisp Language Specification](#2-minilisp-language-specification)
|
||||||
|
3. [Compiler Architecture: The Pipeline](#3-compiler-architecture-the-pipeline)
|
||||||
|
|
||||||
|
**Part 2 — Parsing with nom**
|
||||||
|
|
||||||
|
4. [Introduction to nom: Parser Combinators](#4-introduction-to-nom-parser-combinators)
|
||||||
|
5. [Setting Up the Project](#5-setting-up-the-project)
|
||||||
|
6. [Recognizing Atoms: Integers, Booleans, Strings, Symbols](#6-recognizing-atoms-integers-booleans-strings-symbols)
|
||||||
|
7. [The Abstract Syntax Tree](#7-the-abstract-syntax-tree)
|
||||||
|
8. [Parsing Atoms with nom](#8-parsing-atoms-with-nom)
|
||||||
|
9. [Parsing S-Expressions and Special Forms](#9-parsing-s-expressions-and-special-forms)
|
||||||
|
|
||||||
|
**Part 3 — Semantic Analysis**
|
||||||
|
|
||||||
|
10. [Symbol Tables and Scope](#10-symbol-tables-and-scope)
|
||||||
|
11. [Checking Special Forms](#11-checking-special-forms)
|
||||||
|
|
||||||
|
**Part 4 — Code Generation**
|
||||||
|
|
||||||
|
12. [The C Runtime Preamble](#12-the-c-runtime-preamble)
|
||||||
|
13. [Generating C: Atoms and Expressions](#13-generating-c-atoms-and-expressions)
|
||||||
|
14. [Generating C: Definitions and Functions](#14-generating-c-definitions-and-functions)
|
||||||
|
15. [Generating C: Control Flow and Sequencing](#15-generating-c-control-flow-and-sequencing)
|
||||||
|
|
||||||
|
**Part 5 — Putting It Together**
|
||||||
|
|
||||||
|
16. [The Compilation Pipeline](#16-the-compilation-pipeline)
|
||||||
|
17. [Testing the Compiler](#17-testing-the-compiler)
|
||||||
|
18. [What's Next: Extensions and Further Reading](#18-whats-next-extensions-and-further-reading)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 1 — Foundations
|
||||||
|
|
||||||
|
### 1. Introduction: What We're Building
|
||||||
|
|
||||||
|
A compiler is a program that transforms source code written in one language into equivalent code in another. By the end of this course you will have written one that accepts MiniLisp — a small, clean dialect of Lisp — and produces human-readable C that you can compile and run with any standard C compiler. Along the way you will implement each classic compiler stage from scratch: lexical analysis, parsing, semantic analysis, and code generation.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:e8da8b].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. MiniLisp Language Specification
|
||||||
|
|
||||||
|
MiniLisp is the source language of our compiler. It is a minimal Lisp dialect with integers, booleans, strings, first-class functions, lexical scope, and a small set of built-in operators. This section defines every syntactic form precisely, gives the grammar in EBNF, and shows a complete example program so you know exactly what the compiler must handle before you write a single line of Rust.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:a93829].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Compiler Architecture: The Pipeline
|
||||||
|
|
||||||
|
Our compiler is a classic multi-stage pipeline: source text passes through a parser, producing an AST; the AST passes through a semantic analyser, which validates scope and form usage; the validated AST passes through a code generator, which emits C. This section maps that pipeline onto the module structure you will build and explains how data and errors flow between stages.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:3aeb62].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 2 — Parsing with nom
|
||||||
|
|
||||||
|
### 4. Introduction to nom: Parser Combinators
|
||||||
|
|
||||||
|
nom is a parser-combinator library: instead of writing a grammar file and running a generator, you write small Rust functions that each recognise a fragment of input, then combine them into larger parsers. This section introduces the core `IResult<I, O, E>` type, walks through the essential combinators (`tag`, `char`, `alt`, `many0`, `map`, `tuple`, `delimited`, `preceded`), and shows how to write, compose, and test parsers before you apply any of this to MiniLisp.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:5835e9].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Setting Up the Project
|
||||||
|
|
||||||
|
You will create a new Rust binary crate for the compiler, add nom and any other dependencies to `Cargo.toml`, and lay out the module structure that the rest of the course fills in. By the end of this section you will have a project that compiles, a `src/main.rs` that reads from stdin, and placeholder modules for each compiler stage.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:3dc36b].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Recognizing Atoms: Integers, Booleans, Strings, Symbols
|
||||||
|
|
||||||
|
Before building the full parser, you need nom parsers for each atomic value in MiniLisp: signed integers, boolean literals `#t` and `#f`, double-quoted strings with escape sequences, and symbol identifiers. This section develops each atom parser in isolation, explains the nom combinators used, and provides exercises to test your understanding before the parts are assembled into the full parser.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:685f5e].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. The Abstract Syntax Tree
|
||||||
|
|
||||||
|
The parser's output is an **Abstract Syntax Tree** — a Rust data structure that captures the meaning of a MiniLisp program without the syntactic noise of parentheses and whitespace. This section defines the `Expr` enum and its variants, discusses why the tree is structured the way it is, and implements `Display` so you can inspect parse results during development.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:a1a827].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. Parsing Atoms with nom
|
||||||
|
|
||||||
|
With atom parsers and the AST defined, this section assembles them into a single `parse_atom` function that recognises any MiniLisp atom and returns the corresponding `Expr` variant. You will use `alt` to try each alternative in turn, learn how nom reports errors and how to interpret them, and write unit tests that verify correct parsing of every atom type.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:b6c9ad].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9. Parsing S-Expressions and Special Forms
|
||||||
|
|
||||||
|
S-expressions are parenthesised lists: the heart of Lisp syntax. This section extends the parser to handle arbitrarily nested lists, whitespace between elements, and comments. It then lifts special forms — `define`, `if`, `lambda`, `let`, `begin` — out of the generic list parser so they become distinct AST variants, and covers how to handle recursive parsers in nom without running into borrow-checker problems.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:a4c9f8].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 3 — Semantic Analysis
|
||||||
|
|
||||||
|
### 10. Symbol Tables and Scope
|
||||||
|
|
||||||
|
A symbol table maps names to their definitions. This section walks through a scope-aware traversal of the AST that builds a symbol table, resolves every symbol reference to its definition, and reports helpful errors for undefined names or names used outside their scope. You will implement a simple environment chain — the standard technique for representing nested lexical scopes.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:d0b9f8].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Checking Special Forms
|
||||||
|
|
||||||
|
Special forms have fixed shapes: `if` needs exactly three sub-expressions; `define` needs a name and a body; `lambda` needs a parameter list and at least one body expression. This section adds arity and shape checks for each special form so that malformed programs produce clear error messages rather than mysterious C output.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:6d40a7].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 4 — Code Generation
|
||||||
|
|
||||||
|
### 12. The C Runtime Preamble
|
||||||
|
|
||||||
|
Every MiniLisp program compiles to a C file that begins with a standard preamble: `#include` directives, type aliases, boolean constants, and thin wrappers for built-in operations like `display` and `newline`. This section designs the preamble, explains why each piece is there, and shows how the code generator emits it before any user-defined code.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:3e1250].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 13. Generating C: Atoms and Expressions
|
||||||
|
|
||||||
|
This section implements the expression code generator — the recursive function that turns an `Expr` into a C expression string. Integers become C integer literals; booleans become `TRUE` and `FALSE`; strings become string literals; arithmetic and comparison operations become C operators; function calls become C function-call syntax. You will also handle name-mangling: turning Lisp symbols like `my-var` into valid C identifiers.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:1eb794].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 14. Generating C: Definitions and Functions
|
||||||
|
|
||||||
|
Top-level `define` forms and `lambda` expressions compile to C function and variable declarations. This section covers how to emit forward declarations (so mutual recursion works), how to turn a MiniLisp parameter list into a C function signature, how `lambda` compiles to a named C function, and how top-level definitions are ordered in the output file.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:cbc6e3].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 15. Generating C: Control Flow and Sequencing
|
||||||
|
|
||||||
|
`if`, `begin`, and `let` each require their own code-generation strategy. `if` becomes a C ternary expression or an `if`/`else` statement depending on context; `begin` becomes a sequence of C statements with the last value forwarded; `let` introduces a C block with local variable declarations. This section works through each form and resolves the practical question of when to emit expressions versus statements.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:de82f1].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 5 — Putting It Together
|
||||||
|
|
||||||
|
### 16. The Compilation Pipeline
|
||||||
|
|
||||||
|
With all stages implemented, this section wires them into a single `compile` function and builds a CLI entry point that reads MiniLisp from a file or stdin and writes C to stdout or a file. You will add basic error reporting that shows the source location of each failure and trace a complete example — a recursive factorial function — through every stage.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:58b37a].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 17. Testing the Compiler
|
||||||
|
|
||||||
|
Good tests are what turn a working prototype into a reliable tool. This section adds unit tests for each compiler stage and integration tests that compile MiniLisp programs, feed the C output to `cc`, run the binary, and assert on stdout. You will build a small test corpus of MiniLisp programs covering all language features and ensure the compiler handles both valid and invalid input gracefully.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:8fa47a].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 18. What's Next: Extensions and Further Reading
|
||||||
|
|
||||||
|
The compiler you have built is deliberately minimal — a solid foundation. This final section surveys the directions you can take it further: tail-call optimisation, closures and lambda lifting, a garbage collector, hygienic macros, a type system, an interactive REPL, and a self-hosting MiniLisp standard library. It closes with a curated reading list for going deeper into compiler theory and Lisp implementation.
|
||||||
|
|
||||||
|
🚧 Full content tracked in [nbd:1d16da].
|
||||||
@ -0,0 +1,293 @@
|
|||||||
|
# Vector Database Self-Guided Course
|
||||||
|
|
||||||
|
This document is a self-guided course on vector databases. It is organized into four parts: conceptual foundations, the internals of vector search systems, hands-on Rust exercises with Turso and sqlite-vec, and real-world application pipelines. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an `nbd` ticket — follow the ticket ID to find the detailed learning objectives and instructions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
**Part 1 — Foundations**
|
||||||
|
|
||||||
|
1. [What Is a Vector?](#1-what-is-a-vector)
|
||||||
|
2. [Embeddings](#2-embeddings)
|
||||||
|
3. [Vector Similarity](#3-vector-similarity)
|
||||||
|
|
||||||
|
**Part 2 — Vector Databases**
|
||||||
|
|
||||||
|
4. [What Is a Vector Database?](#4-what-is-a-vector-database)
|
||||||
|
5. [Under the Hood: ANN Algorithms](#5-under-the-hood-ann-algorithms)
|
||||||
|
|
||||||
|
**Part 3 — Turso + sqlite-vec Basics**
|
||||||
|
|
||||||
|
6. [Setting Up](#6-setting-up)
|
||||||
|
7. [Exercise 1 — Storing and Retrieving Vectors](#7-exercise-1--storing-and-retrieving-vectors)
|
||||||
|
8. [Exercise 2 — K-Nearest Neighbor Search](#8-exercise-2--k-nearest-neighbor-search)
|
||||||
|
|
||||||
|
**Part 4 — Real Applications**
|
||||||
|
|
||||||
|
9. [Generating Embeddings in Rust](#9-generating-embeddings-in-rust)
|
||||||
|
10. [Exercise 3 — Semantic Document Search](#10-exercise-3--semantic-document-search)
|
||||||
|
11. [Exercise 4 — Recommendation Engine](#11-exercise-4--recommendation-engine)
|
||||||
|
12. [Exercise 5 — Retrieval-Augmented Generation](#12-exercise-5--retrieval-augmented-generation)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 1 — Foundations
|
||||||
|
|
||||||
|
### 1. What Is a Vector?
|
||||||
|
|
||||||
|
A **vector** is an ordered list of numbers. That is the entire definition — nothing more exotic than a list where position matters. A two-element list `[3.0, 4.0]` is a vector; so is a 1 536-element list of floating-point values produced by a language model. What makes vectors useful is that the numbers have a geometric interpretation: each element is a coordinate along one axis of a space, and the vector as a whole names a point (or an arrow from the origin to that point) in that space.
|
||||||
|
|
||||||
|
**Geometric intuition in two and three dimensions.** Start with the familiar. A 2-dimensional vector `[x, y]` is a point in the plane — the kind you plot on graph paper. The vector `[3.0, 4.0]` sits three units to the right of the origin and four units up. An arrow drawn from `[0, 0]` to `[3, 4]` has a **magnitude** (length) of `√(3² + 4²) = 5` and points in a specific **direction**. Magnitude and direction together completely characterise the vector; change either one and you have a different vector.
|
||||||
|
|
||||||
|
A 3-dimensional vector `[x, y, z]` extends this to physical space: three coordinates, three axes, one point. You can still compute a magnitude — `√(x² + y² + z²)` — and you can still talk about direction. Two 3D vectors point in the same direction if one is a positive scalar multiple of the other; they are **perpendicular** (orthogonal) if their dot product is zero.
|
||||||
|
|
||||||
|
**High-dimensional spaces.** Nothing in the definition of a vector limits it to two or three elements. A *d*-dimensional vector `[x₁, x₂, …, x_d]` is a point in *d*-dimensional space. The geometry extends perfectly: magnitude is `√(x₁² + x₂² + … + x_d²)`, the dot product of two vectors is `Σᵢ aᵢ · bᵢ`, and you can compute angles and distances between points just as you would in 2D or 3D.
|
||||||
|
|
||||||
|
High-dimensional geometry is counterintuitive in subtle ways that are worth knowing:
|
||||||
|
|
||||||
|
- **The curse of dimensionality.** In high-dimensional spaces, most of the volume of a hypersphere is concentrated near its surface rather than its interior. Two randomly chosen high-dimensional vectors from a standard distribution tend to be nearly orthogonal — their dot product is close to zero — even when you have not deliberately constructed them that way. This means "nearest neighbour" in high dimensions is a harder problem than it sounds: there are exponentially many directions, and nearby points can seem far away using simple distance measures.
|
||||||
|
|
||||||
|
- **Normalisation changes the geometry.** A **unit vector** has magnitude exactly 1. Dividing a vector by its magnitude — **normalisation** — projects all vectors onto the surface of the unit hypersphere. On that sphere, distance and angle are equivalent measures of similarity, which simplifies many computations. Embedding models often output unit-normalised vectors precisely to exploit this equivalence.
|
||||||
|
|
||||||
|
- **Dimensions are not independent features.** When people say a language model embeds words into a 768-dimensional space, they do not mean "dimension 42 encodes the concept of colour." The axes of an embedding space are rarely interpretable on their own. Meaning is encoded in the *relative positions* of points — which vectors are close to which others — not in the values along any single axis.
|
||||||
|
|
||||||
|
**Vectors as representations.** The key insight that makes vector databases useful is that real-world objects — documents, images, audio clips, products, users — can be represented as vectors such that *similarity in meaning or content corresponds to proximity in the vector space*. Two documents that discuss the same topic will, if embedded well, produce vectors that are close together. Two documents on unrelated topics will produce vectors that are far apart.
|
||||||
|
|
||||||
|
This is not magic; it is the result of training a model to produce embeddings where similar inputs cluster near each other. Once you have such a model, every search or comparison problem reduces to a geometric problem: find the vectors closest to a query vector. The rest of this course is about how to do that efficiently at scale.
|
||||||
|
|
||||||
|
**A note on notation.** Throughout this course, vectors are written in bold or with subscripts: **v**, **q**, or `v₁`. The *i*-th element of a vector **v** is written `v[i]` or `vᵢ`. The magnitude of **v** is written `|v|` or `‖v‖`. Dimension is written *d* and the number of stored vectors is written *n*.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Embeddings
|
||||||
|
|
||||||
|
Embeddings are the bridge between raw data and vector space. This section covers how language models, image encoders, and other neural networks learn to map heterogeneous inputs — words, sentences, images, products — into vectors where geometric proximity captures semantic similarity. 🚧 Full content tracked in [nbd:584e0c].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Vector Similarity
|
||||||
|
|
||||||
|
Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: **cosine similarity**, **dot product**, and **Euclidean distance** — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 2 — Vector Databases
|
||||||
|
|
||||||
|
### 4. What Is a Vector Database?
|
||||||
|
|
||||||
|
A vector database is a data store built around one core operation: given a query vector **q**, return the *k* stored vectors most similar to **q**. This section covers what that means in practice — approximate nearest-neighbour (ANN) search, the use cases that make vector databases essential (semantic search, recommendations, RAG), and how they differ from traditional relational or key-value databases. 🚧 Full content tracked in [nbd:d9f850].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Under the Hood: ANN Algorithms
|
||||||
|
|
||||||
|
Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for production use. This section explains the two dominant approximate methods — **HNSW** (Hierarchical Navigable Small World graphs) and **IVFFlat** (Inverted File with flat quantisation) — their index construction, query-time traversal, and the recall vs. latency trade-off each exposes. 🚧 Full content tracked in [nbd:6ec5ff].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 3 — Turso + sqlite-vec Basics
|
||||||
|
|
||||||
|
### 6. Setting Up
|
||||||
|
|
||||||
|
This section walks through everything you need before writing a single SQL query: adding the right crates, opening a local Turso connection, and loading the `sqlite-vec` extension that gives SQLite vector-search superpowers.
|
||||||
|
|
||||||
|
#### What You Are Building
|
||||||
|
|
||||||
|
Turso is a SQLite-compatible database with built-in support for vector similarity search via the `sqlite-vec` extension. In local development you use a file-backed SQLite database; in production the same code points at a Turso cloud database. The `libsql` crate (the Rust client for Turso) speaks the Turso wire protocol and also handles local SQLite files transparently.
|
||||||
|
|
||||||
|
#### Cargo.toml
|
||||||
|
|
||||||
|
Create a new binary project and add the following dependencies:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
cargo new vec-demo
|
||||||
|
cd vec-demo
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace the `[dependencies]` section of `Cargo.toml` with:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[dependencies]
|
||||||
|
libsql = "0.9"
|
||||||
|
tokio = { version = "1", features = ["full"] }
|
||||||
|
```
|
||||||
|
|
||||||
|
`libsql` is the official Rust client for Turso / libSQL databases. It supports both local SQLite files and remote Turso connections with the same API, making it straightforward to develop locally and deploy to the cloud. `tokio` provides the async runtime — all `libsql` operations are `async`.
|
||||||
|
|
||||||
|
Add the release-build optimisation profile from the project conventions:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[profile.release]
|
||||||
|
opt-level = "z"
|
||||||
|
lto = true
|
||||||
|
strip = true
|
||||||
|
codegen-units = 1
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Opening a Local Connection
|
||||||
|
|
||||||
|
Replace `src/main.rs` with the following:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use libsql::{Builder, Database};
|
||||||
|
|
||||||
|
#[tokio::main]
|
||||||
|
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||||
|
let db: Database = Builder::new_local("vectors.db").build().await?;
|
||||||
|
let conn = db.connect()?;
|
||||||
|
|
||||||
|
// Verify the connection works
|
||||||
|
let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
|
||||||
|
if let Some(row) = rows.next().await? {
|
||||||
|
let version: String = row.get(0)?;
|
||||||
|
println!("SQLite version: {version}");
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Run it with `cargo run`. You should see output like:
|
||||||
|
|
||||||
|
```
|
||||||
|
SQLite version: 3.46.0
|
||||||
|
```
|
||||||
|
|
||||||
|
A file named `vectors.db` will appear in the current directory. This is a standard SQLite database — you can open it with any SQLite client to inspect its contents.
|
||||||
|
|
||||||
|
#### Enabling Vector Support with sqlite-vec
|
||||||
|
|
||||||
|
The `libsql` crate ships with `sqlite-vec` built in. No separate installation is required. Vector functions become available automatically once you use the right column types and functions in your SQL.
|
||||||
|
|
||||||
|
The key types and functions you will use throughout this course:
|
||||||
|
|
||||||
|
| Construct | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `F32_BLOB(d)` | Column type for storing a *d*-dimensional float32 vector |
|
||||||
|
| `vector(json_array)` | Creates a vector from a JSON array literal |
|
||||||
|
| `vector_extract(blob)` | Converts a stored vector blob back to a JSON array |
|
||||||
|
| `vector_distance_cos(a, b)` | Cosine distance between two vectors (0 = identical, 2 = opposite) |
|
||||||
|
| `libsql_vector_idx(col)` | Index type for fast approximate nearest-neighbour search |
|
||||||
|
| `vector_top_k(table, query, k)` | Table-valued function: returns the *k* nearest rows to a query vector |
|
||||||
|
|
||||||
|
#### Creating a Vector Table
|
||||||
|
|
||||||
|
Extend `main` to create a table that stores 3-dimensional float32 vectors:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
conn.execute(
|
||||||
|
"CREATE TABLE IF NOT EXISTS items (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
label TEXT NOT NULL,
|
||||||
|
embedding F32_BLOB(3) NOT NULL
|
||||||
|
)",
|
||||||
|
(),
|
||||||
|
).await?;
|
||||||
|
```
|
||||||
|
|
||||||
|
`F32_BLOB(3)` declares a column that holds a 3-dimensional float32 vector stored as a binary blob. The `3` is the dimensionality — use the actual size of your embedding model's output (e.g., `F32_BLOB(768)` for a 768-dimensional model) in real projects.
|
||||||
|
|
||||||
|
#### Creating a Vector Index
|
||||||
|
|
||||||
|
Without an index, nearest-neighbour search performs a full table scan — computing the distance from the query to every stored vector. For small tables this is fine; at scale you need an index:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
conn.execute(
|
||||||
|
"CREATE INDEX IF NOT EXISTS items_vec_idx
|
||||||
|
ON items (embedding)
|
||||||
|
USING libsql_vector_idx(embedding)",
|
||||||
|
(),
|
||||||
|
).await?;
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates an HNSW index over the `embedding` column. Queries that use `vector_top_k` will automatically use this index. The index is updated incrementally as rows are inserted or deleted — no manual rebuild is required.
|
||||||
|
|
||||||
|
#### Putting It Together
|
||||||
|
|
||||||
|
At this point your `main.rs` should look like this:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
use libsql::{Builder, Database};
|
||||||
|
|
||||||
|
#[tokio::main]
|
||||||
|
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||||
|
let db: Database = Builder::new_local("vectors.db").build().await?;
|
||||||
|
let conn = db.connect()?;
|
||||||
|
|
||||||
|
// Verify connection
|
||||||
|
let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
|
||||||
|
if let Some(row) = rows.next().await? {
|
||||||
|
let version: String = row.get(0)?;
|
||||||
|
println!("SQLite version: {version}");
|
||||||
|
}
|
||||||
|
|
||||||
|
// Create vector table
|
||||||
|
conn.execute(
|
||||||
|
"CREATE TABLE IF NOT EXISTS items (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
label TEXT NOT NULL,
|
||||||
|
embedding F32_BLOB(3) NOT NULL
|
||||||
|
)",
|
||||||
|
(),
|
||||||
|
).await?;
|
||||||
|
|
||||||
|
// Create HNSW index
|
||||||
|
conn.execute(
|
||||||
|
"CREATE INDEX IF NOT EXISTS items_vec_idx
|
||||||
|
ON items (embedding)
|
||||||
|
USING libsql_vector_idx(embedding)",
|
||||||
|
(),
|
||||||
|
).await?;
|
||||||
|
|
||||||
|
println!("Database ready.");
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`cargo run` should print:
|
||||||
|
|
||||||
|
```
|
||||||
|
SQLite version: 3.46.0
|
||||||
|
Database ready.
|
||||||
|
```
|
||||||
|
|
||||||
|
You now have a working local vector database. Exercises 1 through 5 build on this foundation, adding data, querying it, and connecting the full embedding-to-search pipeline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Exercise 1 — Storing and Retrieving Vectors
|
||||||
|
|
||||||
|
**Goal:** Insert a small set of labelled vectors into the `items` table created in §6, then retrieve them with a `SELECT` and deserialize the stored blob back into a Rust `Vec<f32>`. 🚧 Full content tracked in [nbd:081a55].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. Exercise 2 — K-Nearest Neighbor Search
|
||||||
|
|
||||||
|
**Goal:** Use `vector_top_k` and `vector_distance_cos` to find the *k* vectors in the database most similar to a query vector, and display the results ranked by similarity score. 🚧 Full content tracked in [nbd:5674ce].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 4 — Real Applications
|
||||||
|
|
||||||
|
### 9. Generating Embeddings in Rust
|
||||||
|
|
||||||
|
Before you can search by meaning, you need a way to convert text into vectors. This section covers two approaches available in Rust: running a local embedding model with `fastembed-rs` (no API key, works offline, suited for smaller models) and calling an HTTP embedding API such as the OpenAI Embeddings endpoint (larger, higher-quality models at the cost of latency and a network dependency). 🚧 Full content tracked in [nbd:4c961f].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 10. Exercise 3 — Semantic Document Search
|
||||||
|
|
||||||
|
**Goal:** Build a complete semantic search pipeline: embed a small corpus of text documents, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-*k* most relevant documents using vector similarity — all without any keyword matching. 🚧 Full content tracked in [nbd:1ef9f4].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Exercise 4 — Recommendation Engine
|
||||||
|
|
||||||
|
**Goal:** Implement item-based collaborative filtering using vector similarity. Store item feature vectors (or learned item embeddings) in Turso, then given a target item, retrieve the *k* most similar items as recommendations. 🚧 Full content tracked in [nbd:e8be9a].
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 12. Exercise 5 — Retrieval-Augmented Generation
|
||||||
|
|
||||||
|
**Goal:** Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model's grounded answer back to the user. 🚧 Full content tracked in [nbd:5ed295].
|
||||||
Loading…
Reference in New Issue