You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2.6 KiB

+++ title = "§3 Vector Similarity" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++

§3 Vector Similarity — Stub to fill

File: edu/src/vector-db.md, section ### 3. Vector Similarity

Replace this stub line with full content:

Once you have two vectors, how do you measure how alike they are? [...] 🚧 Full content tracked in [nbd:99e1d9].

This is a reading lesson with inline math — no Rust code. Target 400600 words. Bold lead phrases, inline math using Unicode (not LaTeX). Include a small worked example with concrete 3D numbers.

Learning objectives

  • Know the three main similarity/distance functions: cosine similarity, dot product, Euclidean distance
  • Understand the formula and geometric meaning of each
  • Know the relationship between cosine similarity and cosine distance (what vector_distance_cos actually returns)
  • Know when each metric is appropriate
  • Understand why normalised vectors simplify the choice

Content to write

Cosine similarity. Formula: cos(θ) = (a · b) / (‖a‖ · ‖b‖). Range 1 to 1 (1 = same direction, 0 = orthogonal, 1 = opposite). Measures the angle between vectors, ignoring magnitude. Ideal for text embeddings: a short and long document on the same topic produce vectors that differ in magnitude but not direction.

Cosine distance. 1 cosine_similarity. Range 0 to 2. This is what sqlite-vec's vector_distance_cos returns (0 = identical, 2 = fully opposite). Clarify the naming: the function name says "cos" but returns a distance, not a similarity — smaller is more similar.

Dot product. Formula: a · b = Σᵢ aᵢbᵢ. For unit-normalised vectors, dot product equals cosine similarity (since ‖a‖ = ‖b‖ = 1 cancels out). For unnormalised vectors, it conflates magnitude and angle. Some models are trained specifically for maximum inner product search (MIPS) — their documentation will say so.

Euclidean (L2) distance. Formula: ‖a b‖ = √(Σᵢ (aᵢ bᵢ)²). Range 0 to ∞. Sensitive to vector magnitude. Appropriate for low-dimensional geometric/tabular data where absolute coordinate values carry meaning.

When to use each. Text and sentence embeddings: cosine (or dot product if model outputs unit vectors, which many do). Follow the model card's recommendation when specified. Low-dimensional geometric features: L2.

Worked example. Use vectors a = [1, 0, 1] and b = [1, 1, 0]. Compute all three by hand and show the arithmetic step by step. Cosine similarity = 0.5, L2 distance ≈ 1.414, dot product = 1. This concretises the formulas before the reader sees them in SQL queries.