+++ title = "§10 Exercise 3: Semantic Document Search" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++ ## §10 Exercise 3 — Semantic Document Search — Stub to fill File: `edu/src/vector-db.md`, section `### 10. Exercise 3 — Semantic Document Search` Replace this stub line with the full exercise: > **Goal:** Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4]. Follow the exercise format from `edu/src/markov.md`. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline. ## Goal Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching. ## Setup New project or extend vec-demo. Cargo.toml: ```toml [dependencies] libsql = "0.9" fastembed = "4" tokio = { version = "1", features = ["full"] } ``` Table schema uses `F32_BLOB(384)` (BGE-Small-EN-v1.5 output dimension): ```sql CREATE TABLE IF NOT EXISTS docs ( id INTEGER PRIMARY KEY, passage TEXT NOT NULL, embedding F32_BLOB(384) NOT NULL ) ``` ## Corpus to use (15 passages across 3 topics) **Rust programming (5):** - "Rust uses an ownership system to guarantee memory safety without a garbage collector." - "The borrow checker enforces that references do not outlive the data they point to." - "Cargo is Rust's build system and package manager, used to manage dependencies and run tests." - "Rust's trait system enables zero-cost abstractions and compile-time polymorphism." - "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently." **Astronomy (5):** - "A black hole is a region of spacetime where gravity is so strong that nothing can escape." - "The Milky Way galaxy contains an estimated 100 to 400 billion stars." - "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei." - "The cosmic microwave background is the thermal radiation left over from the early universe." - "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity." **Cooking (5):** - "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions." - "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures." - "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin." - "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt." - "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables." ## Steps to cover **Step 1 — Embed the corpus.** Use fastembed to produce a `Vec>` for all 15 passages in one `model.embed()` call (batch is more efficient than one-at-a-time). **Step 2 — Insert into Turso.** Loop over passages and embeddings together. Format `Vec` as JSON string for `vector(?)`. Use `INSERT OR IGNORE` so re-running is idempotent. **Step 3 — Embed the query and search.** Embed a single query string (same model, `model.embed(vec![query], None)?`), then run `vector_top_k` with k=5 and join to get passage text and cosine distance. **Step 4 — Run three queries and verify results.** Verify the correct topic cluster surfaces: - `"memory safety in systems programming"` → Rust passages - `"stars and galaxies"` → astronomy passages - `"fermentation and cooking techniques"` → cooking passages Print results ranked by distance with the passage text. ## Reference solution Full self-contained `main.rs` inside `
`: creates table, embeds and inserts all 15 passages, runs three queries, prints results.