You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

61 lines
3.6 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

+++
title = "§5 Under the Hood: ANN Algorithms"
priority = 5
status = "todo"
ticket_type = "task"
dependencies = []
+++
## §5 Under the Hood: ANN Algorithms — Stub to fill
File: `edu/src/vector-db.md`, section `### 5. Under the Hood: ANN Algorithms`
Replace this stub line with full content:
> Exact nearest-neighbour search over millions of high-dimensional vectors is too slow [...] 🚧 Full content tracked in [nbd:6ec5ff].
This is a **reading lesson** — no Rust code. Target 600800 words. Include the summary table below.
## Learning objectives
- Understand why exact KNN is impractical at scale (O(n·d) per query)
- Understand how HNSW works conceptually (multi-level navigable graph, greedy search)
- Understand how IVFFlat works conceptually (k-means clustering, inverted index)
- Know the key tuning parameters for each and what they control
- Understand the recall vs. latency trade-off
- Know that sqlite-vec uses HNSW via `libsql_vector_idx`
## Content to write
**Why not exact search?** Brute-force KNN computes distance from the query to every stored vector: O(n·d) per query. At n=1M vectors, d=768 dimensions, and 1000 QPS this is ~768B operations/second — infeasible on a CPU. ANN algorithms find approximate results in O(log n) or sub-linear time at the cost of occasionally missing a few true nearest neighbours.
**HNSW — Hierarchical Navigable Small World.** The dominant algorithm for in-memory ANN, used by sqlite-vec.
Intuition: imagine a multi-level skip list where each level is a proximity graph. The top level is sparse with long-range connections (fast coarse navigation). The bottom level is dense with short-range connections (precise local search). A query starts at the top, greedily moves to whichever neighbour is closest to the query, descends when stuck, and repeats down to the bottom level where the k nearest candidates are collected.
Key parameters:
- `M`: number of bidirectional connections per node. Higher M → better recall, more memory, slower inserts. Typical: 16.
- `ef_construction`: candidate list size during index build. Higher → better index quality, slower build. Typical: 200.
- `ef_search`: candidate list size during query. Higher → better recall, slower query. Often defaults to k.
HNSW supports incremental inserts with no full rebuild. Memory cost is O(n·M·4 bytes).
**IVFFlat — Inverted File with flat quantisation.** The dominant approach for disk-based or GPU-accelerated ANN (used by Faiss, pgvector default).
Intuition: cluster the dataset into `nlist` Voronoi cells using k-means. At query time, find the `nprobe` nearest cell centroids, then do exact search within those cells only — skipping the rest of the dataset.
Key parameters:
- `nlist`: number of clusters. Typical: √n.
- `nprobe`: number of clusters searched at query time. Higher → better recall, slower query.
IVFFlat requires a training step before data can be inserted. Incremental inserts require reassigning to clusters (or periodic retraining). Lower memory than HNSW for the same n.
**sqlite-vec uses HNSW.** The `libsql_vector_idx` index type creates an HNSW index — which is why §6 can insert rows incrementally without a training step. The current API does not expose M or ef parameters; defaults are chosen for broad applicability.
**Summary table.**
| Property | HNSW | IVFFlat |
|---|---|---|
| Query time | O(log n) | O(nprobe · n/nlist) |
| Insert | Incremental | Batch (requires training) |
| Memory | Higher (graph edges) | Lower |
| Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on nprobe) |
| Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss |