You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/archive/edu-ga52--10-exercise-3-sem...

3.8 KiB

title status type priority created_at updated_at
§10 Exercise 3: Semantic Document Search completed task normal 2026-03-10T23:30:00Z 2026-03-10T23:30:00Z

§10 Exercise 3 — Semantic Document Search — Stub to fill

File: edu/src/vector-db.md, section ### 10. Exercise 3 — Semantic Document Search

Replace this stub line with the full exercise:

Goal: Build a complete semantic search pipeline [...] 🚧 Full content tracked in [nbd:1ef9f4].

Follow the exercise format from edu/src/markov.md. This is the first exercise using real embeddings — it combines §6 (Turso setup), §8 (KNN search), and §9 (fastembed) into a complete pipeline.

Goal

Embed a corpus of 15 short text passages with fastembed-rs, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-5 most semantically relevant passages — with no keyword matching.

Setup

New project or extend vec-demo. Cargo.toml:

[dependencies]
libsql = "0.9"
fastembed = "4"
tokio = { version = "1", features = ["full"] }

Table schema uses F32_BLOB(384) (BGE-Small-EN-v1.5 output dimension):

CREATE TABLE IF NOT EXISTS docs (
    id        INTEGER PRIMARY KEY,
    passage   TEXT NOT NULL,
    embedding F32_BLOB(384) NOT NULL
)

Corpus to use (15 passages across 3 topics)

Rust programming (5):

  • "Rust uses an ownership system to guarantee memory safety without a garbage collector."
  • "The borrow checker enforces that references do not outlive the data they point to."
  • "Cargo is Rust's build system and package manager, used to manage dependencies and run tests."
  • "Rust's trait system enables zero-cost abstractions and compile-time polymorphism."
  • "Async Rust uses futures and the tokio runtime to handle concurrent I/O efficiently."

Astronomy (5):

  • "A black hole is a region of spacetime where gravity is so strong that nothing can escape."
  • "The Milky Way galaxy contains an estimated 100 to 400 billion stars."
  • "Neutron stars are the collapsed cores of massive stars, with densities exceeding atomic nuclei."
  • "The cosmic microwave background is the thermal radiation left over from the early universe."
  • "Exoplanets are planets outside our solar system, detected via transit photometry or radial velocity."

Cooking (5):

  • "Maillard reaction gives browned foods their distinctive flavour through amino acid and sugar reactions."
  • "Sous vide cooking involves sealing food in vacuum bags and cooking at precise low temperatures."
  • "Emulsification combines two immiscible liquids, such as oil and water, using an emulsifier like lecithin."
  • "Fermentation converts sugars to acids or alcohol using microorganisms, used in bread, beer, and yogurt."
  • "Knife skills — julienne, brunoise, chiffonade — determine the surface area and cooking time of vegetables."

Steps to cover

Step 1 — Embed the corpus. Use fastembed to produce a Vec<Vec<f32>> for all 15 passages in one model.embed() call (batch is more efficient than one-at-a-time).

Step 2 — Insert into Turso. Loop over passages and embeddings together. Format Vec<f32> as JSON string for vector(?). Use INSERT OR IGNORE so re-running is idempotent.

Step 3 — Embed the query and search. Embed a single query string (same model, model.embed(vec![query], None)?), then run vector_top_k with k=5 and join to get passage text and cosine distance.

Step 4 — Run three queries and verify results. Verify the correct topic cluster surfaces:

  • "memory safety in systems programming" → Rust passages
  • "stars and galaxies" → astronomy passages
  • "fermentation and cooking techniques" → cooking passages

Print results ranked by distance with the passage text.

Reference solution

Full self-contained main.rs inside <details>: creates table, embeds and inserts all 15 passages, runs three queries, prints results.