diff --git a/edu/src/vector-db.md b/edu/src/vector-db.md index 130f6a0..4ce7607 100644 --- a/edu/src/vector-db.md +++ b/edu/src/vector-db.md @@ -78,7 +78,29 @@ This is not magic; it is the result of training a model to produce embeddings wh ### 3. Vector Similarity -Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: **cosine similarity**, **dot product**, and **Euclidean distance** — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9]. +Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity and distance functions used in vector search — their formulas, geometric interpretations, and trade-offs — then works through a concrete example so the arithmetic is familiar before you encounter these functions in SQL. + +**Cosine similarity.** The cosine similarity of two vectors **a** and **b** is defined as cos(θ) = (a · b) / (‖a‖ · ‖b‖), where a · b is the dot product and ‖a‖ is the magnitude of **a**. The result ranges from −1 to 1: a value of 1 means the vectors point in exactly the same direction, 0 means they are orthogonal (perpendicular), and −1 means they point in exactly opposite directions. The critical property of cosine similarity is that it measures only the *angle* between vectors, ignoring their magnitudes entirely. This makes it ideal for text embeddings: a short document and a long document on the same topic may produce vectors that differ in magnitude but point in nearly the same direction, and cosine similarity correctly identifies them as similar. + +**Cosine distance.** Cosine distance is simply 1 − cosine_similarity. Its range is 0 to 2, where 0 means the vectors are identical in direction and 2 means they are fully opposite. This is what sqlite-vec's `vector_distance_cos` function returns. Pay attention to the naming: the function name contains "cos" but it returns a *distance*, not a similarity — smaller values mean more similar vectors, not less. This is a common source of confusion when writing queries for the first time. + +**Dot product.** The dot product of two vectors **a** and **b** is a · b = Σᵢ aᵢbᵢ — multiply corresponding elements and sum the results. For **unit-normalised** vectors (vectors whose magnitude is exactly 1), the dot product equals cosine similarity, because the denominator ‖a‖ · ‖b‖ = 1 · 1 = 1 and cancels out. For unnormalised vectors, the dot product conflates magnitude and direction: a longer vector will produce a larger dot product even if the angle is the same. Some embedding models are trained specifically for maximum inner product search (MIPS), meaning their vectors are *not* unit-normalised and the raw dot product is the intended similarity metric. The model's documentation or model card will say so when this is the case. + +**Euclidean (L2) distance.** The Euclidean distance between two vectors is ‖a − b‖ = √(Σᵢ (aᵢ − bᵢ)²) — the straight-line distance between two points in *d*-dimensional space. Its range is 0 to ∞, with 0 meaning the vectors are identical. Unlike cosine similarity, L2 distance is sensitive to vector magnitude: two vectors pointing in the same direction but with different lengths will have a non-zero L2 distance. L2 is most appropriate for low-dimensional geometric or tabular data where absolute coordinate values carry meaning — for example, geographic coordinates or sensor readings. + +**When to use each.** For text and sentence embeddings, use cosine similarity (or equivalently, dot product if your model outputs unit-normalised vectors, which many do). When in doubt, follow the recommendation on the model card. For low-dimensional geometric features where absolute position matters, use L2 distance. + +**Worked example.** Let **a** = [1, 0, 1] and **b** = [1, 1, 0]. Compute all three metrics by hand: + +*Dot product:* a · b = (1)(1) + (0)(1) + (1)(0) = 1 + 0 + 0 = 1. + +*Magnitudes:* ‖a‖ = √(1² + 0² + 1²) = √2 ≈ 1.414. ‖b‖ = √(1² + 1² + 0²) = √2 ≈ 1.414. + +*Cosine similarity:* cos(θ) = 1 / (√2 · √2) = 1 / 2 = 0.5. The cosine distance is 1 − 0.5 = 0.5, which is what `vector_distance_cos` would return. + +*Euclidean distance:* ‖a − b‖ = √((1−1)² + (0−1)² + (1−0)²) = √(0 + 1 + 1) = √2 ≈ 1.414. + +These three numbers — dot product = 1, cosine similarity = 0.5, L2 distance ≈ 1.414 — describe different aspects of the relationship between **a** and **b**. In the exercises that follow, you will see these same computations expressed as SQL function calls over stored vectors. ---