|
|
+++
|
|
|
title = "§4 What Is a Vector Database?"
|
|
|
priority = 5
|
|
|
status = "done"
|
|
|
ticket_type = "task"
|
|
|
dependencies = []
|
|
|
+++
|
|
|
## §4 What Is a Vector Database? — Stub to fill
|
|
|
|
|
|
File: `edu/src/vector-db.md`, section `### 4. What Is a Vector Database?`
|
|
|
|
|
|
Replace this stub line with full content:
|
|
|
> A vector database is a data store built around one core operation [...] 🚧 Full content tracked in [nbd:d9f850].
|
|
|
|
|
|
This is a **reading lesson** — no Rust code. Target 500–700 words. Bold lead phrases.
|
|
|
|
|
|
## Learning objectives
|
|
|
|
|
|
- Understand the core operation: approximate nearest-neighbour (ANN) search
|
|
|
- Know the primary use cases that motivate vector databases
|
|
|
- Understand how vector databases differ from relational DBs and full-text search
|
|
|
- Know the key performance metrics: recall@k, QPS, index build time, memory
|
|
|
|
|
|
## Content to write
|
|
|
|
|
|
**The core operation.** Given a query vector q and n stored vectors, find the k vectors most similar to q. Exact KNN is O(n·d) per query — at n=1M and d=768 this means 768M operations per query, too slow for interactive use. Vector databases use ANN algorithms (see §5) that trade a small accuracy loss for orders-of-magnitude speed gains.
|
|
|
|
|
|
**Use cases.** Each described in one concrete sentence:
|
|
|
- Semantic search: find documents matching the *meaning* of a query, not just the words
|
|
|
- Recommendation: given an item, return the k most similar items (§11) or surface content preferred by similar users
|
|
|
- Retrieval-Augmented Generation (RAG): retrieve relevant passages before prompting an LLM, so the answer is grounded in facts (§12)
|
|
|
- Duplicate/near-duplicate detection: find items semantically identical or very close to a given item
|
|
|
- Anomaly detection: items far from all stored vectors are likely anomalous
|
|
|
- Multi-modal search: find images matching a text description, using CLIP-style joint embeddings
|
|
|
|
|
|
**vs. relational databases.** SQL WHERE clauses do exact matches and range queries on scalar values. There is no built-in notion of "nearest" for float arrays. Extensions like pgvector (PostgreSQL) and sqlite-vec (SQLite / Turso) add vector search to existing databases — this course uses sqlite-vec via the `libsql` crate.
|
|
|
|
|
|
**vs. full-text search (BM25/TF-IDF).** Keyword search cannot handle synonymy (car ≠ automobile without explicit expansion) or concept-level similarity. Vector search captures both. Hybrid search — combining BM25 and ANN scores — is a common production pattern that outperforms either alone.
|
|
|
|
|
|
**Key metrics.**
|
|
|
- Recall@k: fraction of the true k nearest neighbours that the ANN algorithm returns. A recall@10 of 0.95 means 95% of correct results are found.
|
|
|
- QPS: queries per second the index can serve at a given recall target.
|
|
|
- Index build time: one-time cost paid before serving queries.
|
|
|
- Memory footprint: HNSW stores graph edges in RAM; this limits how large the index can grow on a single machine.
|
|
|
|
|
|
**Where sqlite-vec / Turso fits.** sqlite-vec is appropriate for embedded applications, local development, and small-to-medium corpora (up to a few million vectors). Dedicated cloud vector databases (Pinecone, Qdrant, Weaviate) handle larger scale and add features like multi-tenancy, filtering, and distributed search. |