3.8 KiB
+++ title = "§10 Exercise 3: Semantic Document Search" priority = 5 status = "todo" ticket_type = "task" dependencies = [] +++
§10 Exercise 3 — Semantic Document Search — Stub to fill
File: edu/src/vector-db.md, section ### 10. Exercise 3 — Semantic Document Search
Replace this stub line with the full exercise:
Goal: Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4].
Follow the exercise format from edu/src/markov.md. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline.
Goal
Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching.
Setup
New project or extend vec-demo. Cargo.toml:
[dependencies]
libsql = "0.9"
fastembed = "4"
tokio = { version = "1", features = ["full"] }
Table schema uses F32_BLOB(384) (BGE-Small-EN-v1.5 output dimension):
CREATE TABLE IF NOT EXISTS docs (
id INTEGER PRIMARY KEY,
passage TEXT NOT NULL,
embedding F32_BLOB(384) NOT NULL
)
Corpus to use (15 passages across 3 topics)
Rust programming (5):
- "Rust uses an ownership system to guarantee memory safety without a garbage collector."
- "The borrow checker enforces that references do not outlive the data they point to."
- "Cargo is Rust's build system and package manager, used to manage dependencies and run tests."
- "Rust's trait system enables zero-cost abstractions and compile-time polymorphism."
- "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently."
Astronomy (5):
- "A black hole is a region of spacetime where gravity is so strong that nothing can escape."
- "The Milky Way galaxy contains an estimated 100 to 400 billion stars."
- "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei."
- "The cosmic microwave background is the thermal radiation left over from the early universe."
- "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity."
Cooking (5):
- "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions."
- "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures."
- "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin."
- "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt."
- "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables."
Steps to cover
Step 1 — Embed the corpus. Use fastembed to produce a Vec<Vec<f32>> for all 15 passages in one model.embed() call (batch is more efficient than one-at-a-time).
Step 2 — Insert into Turso. Loop over passages and embeddings together. Format Vec<f32> as JSON string for vector(?). Use INSERT OR IGNORE so re-running is idempotent.
Step 3 — Embed the query and search. Embed a single query string (same model, model.embed(vec![query], None)?), then run vector_top_k with k=5 and join to get passage text and cosine distance.
Step 4 — Run three queries and verify results. Verify the correct topic cluster surfaces:
"memory safety in systems programming"→ Rust passages"stars and galaxies"→ astronomy passages"fermentation and cooking techniques"→ cooking passages
Print results ranked by distance with the passage text.
Reference solution
Full self-contained main.rs inside <details>: creates table, embeds and inserts all 15 passages, runs three queries, prints results.