docs(edu): write §4 what is a vector database for vector-db course [d9f850]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 months ago · 515bc2b6e5
parent 4ab1e85024
commit 515bc2b6e5
1 changed files with 25 additions and 1 deletions
--- a/edu/src/vector-db.md
+++ b/edu/src/vector-db.md
@ -108,7 +108,31 @@ These three numbers — dot product = 1, cosine similarity = 0.5, L2 distance
 ### 4. What Is a Vector Database?
-A vector database is a data store built around one core operation: given a query vector **q**, return the *k* stored vectors most similar to **q**. This section covers what that means in practice — approximate nearest-neighbour (ANN) search, the use cases that make vector databases essential (semantic search, recommendations, RAG), and how they differ from traditional relational or key-value databases. 🚧 Full content tracked in [nbd:d9f850].
+A vector database is a data store built around one core operation: given a query vector **q**, return the *k* stored vectors most similar to **q**. Every other feature — indexing, filtering, replication, APIs — exists to make that single operation fast, accurate, and convenient at scale. This section explains why that operation is hard, what problems it solves, and how vector databases compare to the data systems you already know.
 **The core operation.** Given a query vector **q** and *n* stored vectors, find the *k* vectors most similar to **q**. This is the *k*-nearest-neighbour (KNN) problem. Exact KNN requires computing the distance from **q** to every stored vector — O(*n* · *d*) work per query. At *n* = 1 000 000 and *d* = 768, that is 768 million floating-point operations for a single query, far too slow for interactive use. Vector databases solve this by using approximate nearest-neighbour (ANN) algorithms (covered in §5) that trade a small accuracy loss for orders-of-magnitude speed gains. An ANN index can answer the same query in milliseconds by examining only a tiny fraction of the stored vectors.
 **Use cases.** The ability to find "semantically similar" items powers a wide range of applications:
 - **Semantic search:** find documents that match the *meaning* of a query, not just its keywords — a search for "how to fix a flat tyre" retrieves results about "changing a punctured wheel" even though no words overlap.
 - **Recommendation:** given an item a user just viewed or purchased, return the *k* most similar items from the catalogue (§11), or surface content preferred by users with similar taste profiles.
 - **Retrieval-Augmented Generation (RAG):** retrieve the most relevant passages from a knowledge base before prompting a large language model, so the model's answer is grounded in real documents rather than its training data alone (§12).
 - **Duplicate and near-duplicate detection:** identify items that are semantically identical or extremely close to a given item — useful for deduplicating support tickets, detecting plagiarism, or clustering similar product listings.
 - **Anomaly detection:** items whose vectors are far from all stored vectors are likely anomalous, enabling outlier detection without hand-crafted rules.
 - **Multi-modal search:** find images matching a text description, or vice versa, by storing CLIP-style joint embeddings where text and image vectors share the same space.
 **vs. relational databases.** SQL `WHERE` clauses perform exact matches and range queries on scalar values — equality, greater-than, `LIKE`, `IN`. There is no built-in notion of "nearest" for an array of floats. You cannot write `ORDER BY similarity(embedding, ?)` in standard SQL because the concept does not exist in the relational model. Extensions like **pgvector** (PostgreSQL) and **sqlite-vec** (SQLite / Turso) add vector column types, distance functions, and ANN indexes to existing relational databases, letting you combine vector search with traditional filtering in a single query. This course uses sqlite-vec via the `libsql` crate, which means you get vector search without leaving the SQLite ecosystem you may already know.
 **vs. full-text search (BM25 / TF-IDF).** Traditional keyword search scores documents by how often query terms appear, weighted by rarity across the corpus. It works well when users know the exact vocabulary of the documents they want, but it cannot handle synonymy — "car" and "automobile" are unrelated tokens unless you maintain an explicit synonym list — and it has no concept of sentence-level meaning. Vector search captures both synonymy and broader conceptual similarity because the embedding model learns those relationships from data. In practice, **hybrid search** — combining a BM25 keyword score with an ANN vector score — outperforms either method alone and is a common pattern in production systems.
 **Key metrics.** When evaluating a vector database or an ANN index, four numbers matter:
 - **Recall@k:** the fraction of the true *k* nearest neighbours that the ANN algorithm actually returns. A recall@10 of 0.95 means 95 out of every 100 true top-10 results are found; the other 5 are replaced by slightly less similar vectors.
 - **QPS (queries per second):** how many queries the index can serve per second at a given recall target. Higher is better; this is the throughput you care about in production.
 - **Index build time:** the one-time cost paid to construct the search index from raw vectors. HNSW indexes, for example, require inserting each vector into a multi-layer graph, which can take minutes to hours for large datasets.
 - **Memory footprint:** HNSW stores graph edges in RAM alongside the vectors themselves, which limits how large the index can grow on a single machine. Quantisation and disk-backed indexes reduce memory at the cost of recall or latency.
 **Where sqlite-vec and Turso fit.** sqlite-vec is an excellent choice for embedded applications, local development, prototyping, and small-to-medium corpora — up to a few million vectors. It runs inside your application process with no separate server, and Turso adds cloud hosting, replication, and edge caching on top of the same SQLite foundation. For larger-scale deployments — tens of millions of vectors, multi-tenancy, complex filtered search, or distributed indexing — dedicated vector databases such as Pinecone, Qdrant, or Weaviate provide additional infrastructure. The concepts you learn in this course transfer directly: the same embeddings, distance functions, and query patterns apply regardless of which engine you choose.
 ---