@ -78,7 +78,29 @@ This is not magic; it is the result of training a model to produce embeddings wh
### 3. Vector Similarity
Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: **cosine similarity**, **dot product**, and **Euclidean distance** — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9].
Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity and distance functions used in vector search — their formulas, geometric interpretations, and trade-offs — then works through a concrete example so the arithmetic is familiar before you encounter these functions in SQL.
**Cosine similarity.** The cosine similarity of two vectors **a** and **b** is defined as cos(θ) = (a · b) / (‖a‖ · ‖b‖), where a · b is the dot product and ‖a‖ is the magnitude of **a**. The result ranges from −1 to 1: a value of 1 means the vectors point in exactly the same direction, 0 means they are orthogonal (perpendicular), and −1 means they point in exactly opposite directions. The critical property of cosine similarity is that it measures only the *angle* between vectors, ignoring their magnitudes entirely. This makes it ideal for text embeddings: a short document and a long document on the same topic may produce vectors that differ in magnitude but point in nearly the same direction, and cosine similarity correctly identifies them as similar.
**Cosine distance.** Cosine distance is simply 1 − cosine_similarity. Its range is 0 to 2, where 0 means the vectors are identical in direction and 2 means they are fully opposite. This is what sqlite-vec's `vector_distance_cos` function returns. Pay attention to the naming: the function name contains "cos" but it returns a *distance*, not a similarity — smaller values mean more similar vectors, not less. This is a common source of confusion when writing queries for the first time.
**Dot product.** The dot product of two vectors **a** and **b** is a · b = Σᵢ aᵢbᵢ — multiply corresponding elements and sum the results. For **unit-normalised** vectors (vectors whose magnitude is exactly 1), the dot product equals cosine similarity, because the denominator ‖a‖ · ‖b‖ = 1 · 1 = 1 and cancels out. For unnormalised vectors, the dot product conflates magnitude and direction: a longer vector will produce a larger dot product even if the angle is the same. Some embedding models are trained specifically for maximum inner product search (MIPS), meaning their vectors are *not* unit-normalised and the raw dot product is the intended similarity metric. The model's documentation or model card will say so when this is the case.
**Euclidean (L2) distance.** The Euclidean distance between two vectors is ‖a − b‖ = √(Σᵢ (aᵢ − bᵢ)²) — the straight-line distance between two points in *d*-dimensional space. Its range is 0 to ∞, with 0 meaning the vectors are identical. Unlike cosine similarity, L2 distance is sensitive to vector magnitude: two vectors pointing in the same direction but with different lengths will have a non-zero L2 distance. L2 is most appropriate for low-dimensional geometric or tabular data where absolute coordinate values carry meaning — for example, geographic coordinates or sensor readings.
**When to use each.** For text and sentence embeddings, use cosine similarity (or equivalently, dot product if your model outputs unit-normalised vectors, which many do). When in doubt, follow the recommendation on the model card. For low-dimensional geometric features where absolute position matters, use L2 distance.
**Worked example.** Let **a** = [1, 0, 1] and **b** = [1, 1, 0]. Compute all three metrics by hand:
*Dot product:* a · b = (1)(1) + (0)(1) + (1)(0) = 1 + 0 + 0 = 1.
These three numbers — dot product = 1, cosine similarity = 0.5, L2 distance ≈ 1.414 — describe different aspects of the relationship between **a** and **b**. In the exercises that follow, you will see these same computations expressed as SQL function calls over stored vectors.