You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/edu-ga52--10-exercise-3-sem...

84 lines
3.8 KiB
Markdown

---
# edu-ga52
title: '§10 Exercise 3: Semantic Document Search'
status: completed
type: task
priority: normal
created_at: 2026-03-10T23:30:00Z
updated_at: 2026-03-10T23:30:00Z
---
## §10 Exercise 3 — Semantic Document Search — Stub to fill
File: `edu/src/vector-db.md`, section `### 10. Exercise 3 — Semantic Document Search`
Replace this stub line with the full exercise:
> **Goal:** Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4].
Follow the exercise format from `edu/src/markov.md`. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline.
## Goal
Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching.
## Setup
New project or extend vec-demo. Cargo.toml:
```toml
[dependencies]
libsql = "0.9"
fastembed = "4"
tokio = { version = "1", features = ["full"] }
```
Table schema uses `F32_BLOB(384)` (BGE-Small-EN-v1.5 output dimension):
```sql
CREATE TABLE IF NOT EXISTS docs (
id INTEGER PRIMARY KEY,
passage TEXT NOT NULL,
embedding F32_BLOB(384) NOT NULL
)
```
## Corpus to use (15 passages across 3 topics)
**Rust programming (5):**
- "Rust uses an ownership system to guarantee memory safety without a garbage collector."
- "The borrow checker enforces that references do not outlive the data they point to."
- "Cargo is Rust's build system and package manager, used to manage dependencies and run tests."
- "Rust's trait system enables zero-cost abstractions and compile-time polymorphism."
- "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently."
**Astronomy (5):**
- "A black hole is a region of spacetime where gravity is so strong that nothing can escape."
- "The Milky Way galaxy contains an estimated 100 to 400 billion stars."
- "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei."
- "The cosmic microwave background is the thermal radiation left over from the early universe."
- "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity."
**Cooking (5):**
- "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions."
- "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures."
- "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin."
- "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt."
- "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables."
## Steps to cover
**Step 1 — Embed the corpus.** Use fastembed to produce a `Vec<Vec<f32>>` for all 15 passages in one `model.embed()` call (batch is more efficient than one-at-a-time).
**Step 2 — Insert into Turso.** Loop over passages and embeddings together. Format `Vec<f32>` as JSON string for `vector(?)`. Use `INSERT OR IGNORE` so re-running is idempotent.
**Step 3 — Embed the query and search.** Embed a single query string (same model, `model.embed(vec![query], None)?`), then run `vector_top_k` with k=5 and join to get passage text and cosine distance.
**Step 4 — Run three queries and verify results.** Verify the correct topic cluster surfaces:
- `"memory safety in systems programming"` → Rust passages
- `"stars and galaxies"` → astronomy passages
- `"fermentation and cooking techniques"` → cooking passages
Print results ranked by distance with the passage text.
## Reference solution
Full self-contained `main.rs` inside `<details>`: creates table, embeds and inserts all 15 passages, runs three queries, prints results.