You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/archive/edu-twtl--3-vector-similari...

41 lines
2.7 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
# edu-twtl
title: §3 Vector Similarity
status: completed
type: task
priority: normal
created_at: 2026-03-10T23:30:01Z
updated_at: 2026-03-10T23:30:01Z
---
## §3 Vector Similarity — Stub to fill
File: `edu/src/vector-db.md`, section `### 3. Vector Similarity`
Replace this stub line with full content:
> Once you have two vectors, how do you measure how alike they are? [...] 🚧 Full content tracked in [nbd:99e1d9].
This is a **reading lesson with inline math** — no Rust code. Target 400600 words. Bold lead phrases, inline math using Unicode (not LaTeX). Include a small worked example with concrete 3D numbers.
## Learning objectives
- Know the three main similarity/distance functions: cosine similarity, dot product, Euclidean distance
- Understand the formula and geometric meaning of each
- Know the relationship between cosine similarity and cosine distance (what `vector_distance_cos` actually returns)
- Know when each metric is appropriate
- Understand why normalised vectors simplify the choice
## Content to write
**Cosine similarity.** Formula: cos(θ) = (a · b) / (‖a‖ · ‖b‖). Range 1 to 1 (1 = same direction, 0 = orthogonal, 1 = opposite). Measures the angle between vectors, ignoring magnitude. Ideal for text embeddings: a short and long document on the same topic produce vectors that differ in magnitude but not direction.
**Cosine distance.** 1 cosine_similarity. Range 0 to 2. This is what sqlite-vec's `vector_distance_cos` returns (0 = identical, 2 = fully opposite). Clarify the naming: the function name says "cos" but returns a *distance*, not a similarity — smaller is more similar.
**Dot product.** Formula: a · b = Σᵢ aᵢbᵢ. For unit-normalised vectors, dot product equals cosine similarity (since ‖a‖ = ‖b‖ = 1 cancels out). For unnormalised vectors, it conflates magnitude and angle. Some models are trained specifically for maximum inner product search (MIPS) — their documentation will say so.
**Euclidean (L2) distance.** Formula: ‖a b‖ = √(Σᵢ (aᵢ bᵢ)²). Range 0 to ∞. Sensitive to vector magnitude. Appropriate for low-dimensional geometric/tabular data where absolute coordinate values carry meaning.
**When to use each.** Text and sentence embeddings: cosine (or dot product if model outputs unit vectors, which many do). Follow the model card's recommendation when specified. Low-dimensional geometric features: L2.
**Worked example.** Use vectors a = [1, 0, 1] and b = [1, 1, 0]. Compute all three by hand and show the arithmetic step by step. Cosine similarity = 0.5, L2 distance ≈ 1.414, dot product = 1. This concretises the formulas before the reader sees them in SQL queries.