docs(edu): write §5 ANN algorithms for vector-db course [6ec5ff]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 months ago · 297f2d6d2f
parent 515bc2b6e5
commit 297f2d6d2f
1 changed files with 36 additions and 1 deletions
--- a/edu/src/vector-db.md
+++ b/edu/src/vector-db.md
@ -138,7 +138,42 @@ A vector database is a data store built around one core operation: given a query

 ### 5. Under the Hood: ANN Algorithms

-Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for production use. This section explains the two dominant approximate methods — **HNSW** (Hierarchical Navigable Small World graphs) and **IVFFlat** (Inverted File with flat quantisation) — their index construction, query-time traversal, and the recall vs. latency trade-off each exposes. 🚧 Full content tracked in [nbd:6ec5ff].
+**Why not exact search?** Brute-force KNN computes the distance from the query vector to every stored vector — O(*n* · *d*) work per query. At *n* = 1 000 000 vectors, *d* = 768 dimensions, and 1 000 queries per second, that is roughly 768 billion floating-point operations per second — infeasible on a commodity CPU. Approximate nearest-neighbour (ANN) algorithms find results in O(log *n*) or sub-linear time at the cost of occasionally missing a few true nearest neighbours. The two dominant families are HNSW and IVFFlat.
+
+**HNSW — Hierarchical Navigable Small World.** HNSW is the dominant algorithm for in-memory ANN and is the algorithm used by sqlite-vec.
+
+Imagine a multi-level skip list where each level is a proximity graph. The top level is sparse, containing only a small subset of nodes connected by long-range edges that enable fast coarse navigation across the dataset. Each subsequent level adds more nodes and shorter-range edges, increasing density. The bottom level contains every vector, connected to its nearest neighbours by short-range edges that enable precise local search. When a query arrives, the algorithm starts at an entry point on the top level and greedily moves to whichever neighbour is closest to the query vector. When no neighbour on the current level is closer than the current node, the algorithm descends one level and repeats the greedy walk with the denser graph. At the bottom level, it collects the *k* nearest candidates encountered during traversal and returns them as the result.
+
+**HNSW key parameters:**
+
+- **`M`** — the number of bidirectional connections each node maintains per layer. Higher *M* improves recall (the algorithm has more paths to explore) but increases memory consumption and slows down inserts because more edges must be evaluated and updated. A typical default is 16.
+- **`ef_construction`** — the size of the dynamic candidate list used when inserting a new vector into the graph. Higher values produce a higher-quality index (better-connected graph) at the cost of slower index construction. A typical default is 200.
+- **`ef_search`** — the size of the candidate list used during query-time traversal. Higher values improve recall at the cost of higher query latency. This parameter is often set equal to *k* by default, but increasing it is the easiest way to trade latency for accuracy at query time.
+
+HNSW supports incremental inserts with no full rebuild — each new vector is linked into the existing graph structure, which is why the `CREATE INDEX ... USING libsql_vector_idx` in §6 requires no separate training step. The memory cost of the graph is O(*n* · *M* · 4 bytes) on top of the vectors themselves.
+
+**IVFFlat — Inverted File with flat quantisation.** IVFFlat is the dominant approach for disk-based or GPU-accelerated ANN and is used by default in systems like Faiss and pgvector.
+
+The idea is to partition the dataset into `nlist` Voronoi cells using k-means clustering during a one-time training step. Each cell is defined by a centroid vector, and every stored vector is assigned to the cell whose centroid is closest. At query time, the algorithm computes the distance from the query to all `nlist` centroids, selects the `nprobe` nearest centroids, and then performs exact brute-force search only within those cells — skipping the vast majority of the dataset entirely.
+
+**IVFFlat key parameters:**
+
+- **`nlist`** — the number of clusters (Voronoi cells). A common heuristic is to set `nlist` ≈ √*n*. More clusters mean each cell is smaller, so query-time search within a cell is faster, but training takes longer and very small cells increase the risk of a query's true neighbours falling in an unsearched cell.
+- **`nprobe`** — the number of clusters examined at query time. Higher `nprobe` improves recall at the cost of higher latency. Setting `nprobe` = `nlist` degenerates to exact search; setting `nprobe` = 1 checks only the single most likely cluster.
+
+Unlike HNSW, IVFFlat requires a training step (the k-means clustering) before any data can be inserted. Incremental inserts require assigning each new vector to an existing cluster, which can degrade quality over time as the data distribution drifts from the original centroids — periodic retraining is recommended for heavily updated datasets. IVFFlat uses less memory than HNSW for the same *n* because it does not store graph edges.
+
+**sqlite-vec uses HNSW.** The `libsql_vector_idx` index type you created in §6 builds an HNSW index — which is why rows can be inserted incrementally with no training step. The current sqlite-vec API does not expose *M* or *ef* parameters directly; sensible defaults are chosen for broad applicability.
+
+**Summary table.**
+
+| Property | HNSW | IVFFlat |
+|---|---|---|
+| Query time | O(log *n*) | O(*nprobe* · *n* / *nlist*) |
+| Insert | Incremental | Batch (requires training) |
+| Memory | Higher (graph edges) | Lower |
+| Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on *nprobe*) |
+| Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss |

 ---