2.7 KiB
| title | status | type | priority | created_at | updated_at |
|---|---|---|---|---|---|
| §3 Vector Similarity | completed | task | normal | 2026-03-10T23:30:01Z | 2026-03-10T23:30:01Z |
§3 Vector Similarity — Stub to fill
File: edu/src/vector-db.md, section ### 3. Vector Similarity
Replace this stub line with full content:
Once you have two vectors, how do you measure how alike they are? [...] 🚧 Full content tracked in [nbd:99e1d9].
This is a reading lesson with inline math — no Rust code. Target 400–600 words. Bold lead phrases, inline math using Unicode (not LaTeX). Include a small worked example with concrete 3D numbers.
Learning objectives
- Know the three main similarity/distance functions: cosine similarity, dot product, Euclidean distance
- Understand the formula and geometric meaning of each
- Know the relationship between cosine similarity and cosine distance (what
vector_distance_cosactually returns) - Know when each metric is appropriate
- Understand why normalised vectors simplify the choice
Content to write
Cosine similarity. Formula: cos(θ) = (a · b) / (‖a‖ · ‖b‖). Range −1 to 1 (1 = same direction, 0 = orthogonal, −1 = opposite). Measures the angle between vectors, ignoring magnitude. Ideal for text embeddings: a short and long document on the same topic produce vectors that differ in magnitude but not direction.
Cosine distance. 1 − cosine_similarity. Range 0 to 2. This is what sqlite-vec's vector_distance_cos returns (0 = identical, 2 = fully opposite). Clarify the naming: the function name says "cos" but returns a distance, not a similarity — smaller is more similar.
Dot product. Formula: a · b = Σᵢ aᵢbᵢ. For unit-normalised vectors, dot product equals cosine similarity (since ‖a‖ = ‖b‖ = 1 cancels out). For unnormalised vectors, it conflates magnitude and angle. Some models are trained specifically for maximum inner product search (MIPS) — their documentation will say so.
Euclidean (L2) distance. Formula: ‖a − b‖ = √(Σᵢ (aᵢ − bᵢ)²). Range 0 to ∞. Sensitive to vector magnitude. Appropriate for low-dimensional geometric/tabular data where absolute coordinate values carry meaning.
When to use each. Text and sentence embeddings: cosine (or dot product if model outputs unit vectors, which many do). Follow the model card's recommendation when specified. Low-dimensional geometric features: L2.
Worked example. Use vectors a = [1, 0, 1] and b = [1, 1, 0]. Compute all three by hand and show the arithmetic step by step. Cosine similarity = 0.5, L2 distance ≈ 1.414, dot product = 1. This concretises the formulas before the reader sees them in SQL queries.