|
|
+++
|
|
|
title = "§5 Under the Hood: ANN Algorithms"
|
|
|
priority = 5
|
|
|
status = "done"
|
|
|
ticket_type = "task"
|
|
|
dependencies = []
|
|
|
+++
|
|
|
## §5 Under the Hood: ANN Algorithms — Stub to fill
|
|
|
|
|
|
File: `edu/src/vector-db.md`, section `### 5. Under the Hood: ANN Algorithms`
|
|
|
|
|
|
Replace this stub line with full content:
|
|
|
> Exact nearest-neighbour search over millions of high-dimensional vectors is too slow [...] 🚧 Full content tracked in [nbd:6ec5ff].
|
|
|
|
|
|
This is a **reading lesson** — no Rust code. Target 600–800 words. Include the summary table below.
|
|
|
|
|
|
## Learning objectives
|
|
|
|
|
|
- Understand why exact KNN is impractical at scale (O(n·d) per query)
|
|
|
- Understand how HNSW works conceptually (multi-level navigable graph, greedy search)
|
|
|
- Understand how IVFFlat works conceptually (k-means clustering, inverted index)
|
|
|
- Know the key tuning parameters for each and what they control
|
|
|
- Understand the recall vs. latency trade-off
|
|
|
- Know that sqlite-vec uses HNSW via `libsql_vector_idx`
|
|
|
|
|
|
## Content to write
|
|
|
|
|
|
**Why not exact search?** Brute-force KNN computes distance from the query to every stored vector: O(n·d) per query. At n=1M vectors, d=768 dimensions, and 1000 QPS this is ~768B operations/second — infeasible on a CPU. ANN algorithms find approximate results in O(log n) or sub-linear time at the cost of occasionally missing a few true nearest neighbours.
|
|
|
|
|
|
**HNSW — Hierarchical Navigable Small World.** The dominant algorithm for in-memory ANN, used by sqlite-vec.
|
|
|
|
|
|
Intuition: imagine a multi-level skip list where each level is a proximity graph. The top level is sparse with long-range connections (fast coarse navigation). The bottom level is dense with short-range connections (precise local search). A query starts at the top, greedily moves to whichever neighbour is closest to the query, descends when stuck, and repeats down to the bottom level where the k nearest candidates are collected.
|
|
|
|
|
|
Key parameters:
|
|
|
- `M`: number of bidirectional connections per node. Higher M → better recall, more memory, slower inserts. Typical: 16.
|
|
|
- `ef_construction`: candidate list size during index build. Higher → better index quality, slower build. Typical: 200.
|
|
|
- `ef_search`: candidate list size during query. Higher → better recall, slower query. Often defaults to k.
|
|
|
|
|
|
HNSW supports incremental inserts with no full rebuild. Memory cost is O(n·M·4 bytes).
|
|
|
|
|
|
**IVFFlat — Inverted File with flat quantisation.** The dominant approach for disk-based or GPU-accelerated ANN (used by Faiss, pgvector default).
|
|
|
|
|
|
Intuition: cluster the dataset into `nlist` Voronoi cells using k-means. At query time, find the `nprobe` nearest cell centroids, then do exact search within those cells only — skipping the rest of the dataset.
|
|
|
|
|
|
Key parameters:
|
|
|
- `nlist`: number of clusters. Typical: √n.
|
|
|
- `nprobe`: number of clusters searched at query time. Higher → better recall, slower query.
|
|
|
|
|
|
IVFFlat requires a training step before data can be inserted. Incremental inserts require reassigning to clusters (or periodic retraining). Lower memory than HNSW for the same n.
|
|
|
|
|
|
**sqlite-vec uses HNSW.** The `libsql_vector_idx` index type creates an HNSW index — which is why §6 can insert rows incrementally without a training step. The current API does not expose M or ef parameters; defaults are chosen for broad applicability.
|
|
|
|
|
|
**Summary table.**
|
|
|
|
|
|
| Property | HNSW | IVFFlat |
|
|
|
|---|---|---|
|
|
|
| Query time | O(log n) | O(nprobe · n/nlist) |
|
|
|
| Insert | Incremental | Batch (requires training) |
|
|
|
| Memory | Higher (graph edges) | Lower |
|
|
|
| Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on nprobe) |
|
|
|
| Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss | |