+++ title = "§5 Under the Hood: ANN Algorithms" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++ ## §5 Under the Hood: ANN Algorithms — Stub to fill File: `edu/src/vector-db.md`, section `### 5. Under the Hood: ANN Algorithms` Replace this stub line with full content: > Exact nearest-neighbour search over millions of high-dimensional vectors is too slow [...] 🚧 Full content tracked in [nbd:6ec5ff]. This is a **reading lesson** — no Rust code. Target 600–800 words. Include the summary table below. ## Learning objectives - Understand why exact KNN is impractical at scale (O(n·d) per query) - Understand how HNSW works conceptually (multi-level navigable graph, greedy search) - Understand how IVFFlat works conceptually (k-means clustering, inverted index) - Know the key tuning parameters for each and what they control - Understand the recall vs. latency trade-off - Know that sqlite-vec uses HNSW via `libsql_vector_idx` ## Content to write **Why not exact search?** Brute-force KNN computes distance from the query to every stored vector: O(n·d) per query. At n=1M vectors, d=768 dimensions, and 1000 QPS this is ~768B operations/second — infeasible on a CPU. ANN algorithms find approximate results in O(log n) or sub-linear time at the cost of occasionally missing a few true nearest neighbours. **HNSW — Hierarchical Navigable Small World.** The dominant algorithm for in-memory ANN, used by sqlite-vec. Intuition: imagine a multi-level skip list where each level is a proximity graph. The top level is sparse with long-range connections (fast coarse navigation). The bottom level is dense with short-range connections (precise local search). A query starts at the top, greedily moves to whichever neighbour is closest to the query, descends when stuck, and repeats down to the bottom level where the k nearest candidates are collected. Key parameters: - `M`: number of bidirectional connections per node. Higher M → better recall, more memory, slower inserts. Typical: 16. - `ef_construction`: candidate list size during index build. Higher → better index quality, slower build. Typical: 200. - `ef_search`: candidate list size during query. Higher → better recall, slower query. Often defaults to k. HNSW supports incremental inserts with no full rebuild. Memory cost is O(n·M·4 bytes). **IVFFlat — Inverted File with flat quantisation.** The dominant approach for disk-based or GPU-accelerated ANN (used by Faiss, pgvector default). Intuition: cluster the dataset into `nlist` Voronoi cells using k-means. At query time, find the `nprobe` nearest cell centroids, then do exact search within those cells only — skipping the rest of the dataset. Key parameters: - `nlist`: number of clusters. Typical: √n. - `nprobe`: number of clusters searched at query time. Higher → better recall, slower query. IVFFlat requires a training step before data can be inserted. Incremental inserts require reassigning to clusters (or periodic retraining). Lower memory than HNSW for the same n. **sqlite-vec uses HNSW.** The `libsql_vector_idx` index type creates an HNSW index — which is why §6 can insert rows incrementally without a training step. The current API does not expose M or ef parameters; defaults are chosen for broad applicability. **Summary table.** | Property | HNSW | IVFFlat | |---|---|---| | Query time | O(log n) | O(nprobe · n/nlist) | | Insert | Incremental | Batch (requires training) | | Memory | Higher (graph edges) | Lower | | Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on nprobe) | | Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss |