**Goal:** Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model's grounded answer back to the user. 🚧 Full content tracked in [nbd:5ed295].
**Goal:** Build a retrieval-augmented generation (RAG) pipeline that:
1. Stores the 15-passage corpus from §10 in Turso
2. Accepts a natural-language question
3. Retrieves the top-3 most relevant passages using vector KNN
4. Injects the passages into a prompt as context
5. Sends the prompt to an OpenAI-compatible LLM API
6. Prints the grounded answer
**Setup:**
```toml
[dependencies]
libsql = "0.9"
fastembed = "4"
reqwest = { version = "0.12", features = ["json"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["full"] }
```
You will need an API key stored in the `OPENAI_API_KEY` environment variable. This exercise works with any OpenAI-compatible provider — OpenAI itself, Groq, Together AI, or a local Ollama instance (base URL `http://localhost:11434/v1`, model `llama3.2`). Adjust the base URL and model name accordingly if you are not using OpenAI.
#### Step 1 — Retrieval function
Reuse the semantic search logic from §10. Write a function that embeds the query, runs a KNN search, and returns the top-k passage texts:
Set up the database and corpus exactly as in §10, then run three example questions that exercise each topic cluster:
```rust
let questions = vec![
"How does Rust ensure memory safety?",
"What is a black hole?",
"What is the Maillard reaction?",
];
let client = reqwest::Client::new();
let api_key = std::env::var("OPENAI_API_KEY")?;
for question in &questions {
println!("=== Question: \"{question}\" ===\n");
let passages = retrieve(&conn, &model, question, 3).await?;
println!("Retrieved passages:");
for (i, p) in passages.iter().enumerate() {
println!(" {}: {p}", i + 1);
}
println!();
let prompt = build_prompt(&passages, question);
let answer = call_llm(&client, &api_key, &prompt).await?;
println!("Answer: {answer}\n");
}
```
Each question should pull passages from the matching cluster — Rust passages for the first, astronomy for the second, and cooking for the third. The LLM's answer will be grounded in those passages rather than relying on its own parametric knowledge.
#### Step 5 — Discussion: RAG patterns
**Chunk size and overlap.** The 15-passage corpus used here is already conveniently pre-chunked into single sentences, but real documents are rarely so tidy. In practice, long documents are split into overlapping chunks — typically 200–500 tokens with a 50–100 token overlap between consecutive chunks. The overlap ensures that sentences near a chunk boundary are not orphaned from their surrounding context, which would hurt retrieval quality. Choosing the right chunk size is a trade-off: smaller chunks yield more precise retrieval but lose broader context, while larger chunks retain context at the cost of noisier matches.
**Re-ranking.** The ANN index returns approximate nearest neighbors quickly, but the ranking is based on a single embedding similarity score. A cross-encoder re-ranker — a model that takes (query, passage) pairs as input and produces a relevance score — can re-order the top-k candidates for significantly better precision. The typical pattern is to retrieve a larger set (e.g., top-20) with ANN and then re-rank to the final top-3 or top-5 with the cross-encoder.
**Hybrid search.** Semantic (ANN) search excels at matching meaning but can miss exact keywords, while keyword-based search (BM25) is great at exact term matching but blind to synonyms. Combining both — often called hybrid search — frequently outperforms either approach alone. A common fusion strategy is Reciprocal Rank Fusion (RRF), which merges the two ranked lists by summing the reciprocal of each result's rank.
**Context window limits.** The number of passages you can inject depends on the model's context length and the average passage length. GPT-4o-mini supports 128k tokens, but stuffing the entire context window with retrieved passages introduces noise and increases latency and cost. A good heuristic is to inject only enough passages to cover the question — typically 3 to 5 short passages or 1 to 2 longer chunks — and to place the most relevant passages first.