You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
80 lines
3.8 KiB
Markdown
80 lines
3.8 KiB
Markdown
+++
|
|
title = "§10 Exercise 3: Semantic Document Search"
|
|
priority = 5
|
|
status = "todo"
|
|
ticket_type = "task"
|
|
dependencies = []
|
|
+++
|
|
## §10 Exercise 3 — Semantic Document Search — Stub to fill
|
|
|
|
File: `edu/src/vector-db.md`, section `### 10. Exercise 3 — Semantic Document Search`
|
|
|
|
Replace this stub line with the full exercise:
|
|
> **Goal:** Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4].
|
|
|
|
Follow the exercise format from `edu/src/markov.md`. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline.
|
|
|
|
## Goal
|
|
|
|
Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching.
|
|
|
|
## Setup
|
|
|
|
New project or extend vec-demo. Cargo.toml:
|
|
```toml
|
|
[dependencies]
|
|
libsql = "0.9"
|
|
fastembed = "4"
|
|
tokio = { version = "1", features = ["full"] }
|
|
```
|
|
|
|
Table schema uses `F32_BLOB(384)` (BGE-Small-EN-v1.5 output dimension):
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS docs (
|
|
id INTEGER PRIMARY KEY,
|
|
passage TEXT NOT NULL,
|
|
embedding F32_BLOB(384) NOT NULL
|
|
)
|
|
```
|
|
|
|
## Corpus to use (15 passages across 3 topics)
|
|
|
|
**Rust programming (5):**
|
|
- "Rust uses an ownership system to guarantee memory safety without a garbage collector."
|
|
- "The borrow checker enforces that references do not outlive the data they point to."
|
|
- "Cargo is Rust's build system and package manager, used to manage dependencies and run tests."
|
|
- "Rust's trait system enables zero-cost abstractions and compile-time polymorphism."
|
|
- "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently."
|
|
|
|
**Astronomy (5):**
|
|
- "A black hole is a region of spacetime where gravity is so strong that nothing can escape."
|
|
- "The Milky Way galaxy contains an estimated 100 to 400 billion stars."
|
|
- "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei."
|
|
- "The cosmic microwave background is the thermal radiation left over from the early universe."
|
|
- "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity."
|
|
|
|
**Cooking (5):**
|
|
- "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions."
|
|
- "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures."
|
|
- "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin."
|
|
- "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt."
|
|
- "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables."
|
|
|
|
## Steps to cover
|
|
|
|
**Step 1 — Embed the corpus.** Use fastembed to produce a `Vec<Vec<f32>>` for all 15 passages in one `model.embed()` call (batch is more efficient than one-at-a-time).
|
|
|
|
**Step 2 — Insert into Turso.** Loop over passages and embeddings together. Format `Vec<f32>` as JSON string for `vector(?)`. Use `INSERT OR IGNORE` so re-running is idempotent.
|
|
|
|
**Step 3 — Embed the query and search.** Embed a single query string (same model, `model.embed(vec![query], None)?`), then run `vector_top_k` with k=5 and join to get passage text and cosine distance.
|
|
|
|
**Step 4 — Run three queries and verify results.** Verify the correct topic cluster surfaces:
|
|
- `"memory safety in systems programming"` → Rust passages
|
|
- `"stars and galaxies"` → astronomy passages
|
|
- `"fermentation and cooking techniques"` → cooking passages
|
|
|
|
Print results ranked by distance with the passage text.
|
|
|
|
## Reference solution
|
|
|
|
Full self-contained `main.rs` inside `<details>`: creates table, embeds and inserts all 15 passages, runs three queries, prints results. |