+++
title = "§2 Embeddings"
priority = 5
status = "done"
ticket_type = "task"
dependencies = []
+++
## §2 Embeddings — Stub to fill

File: `edu/src/vector-db.md`, section `### 2. Embeddings`

Replace this stub line with full content:
> Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c].

This is a **reading lesson** — no Rust code. Target 400–700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases.

## Learning objectives

- Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors
- Know why geometric proximity in embedding space corresponds to semantic similarity (training objective)
- Understand how contextual encoder models (BERT-style) produce sentence embeddings
- Know typical output dimensionalities (384, 768, 1536) and what influences them
- Understand that embedding axes are not individually interpretable — meaning lives in relative positions

## Content to write

**What an embedding is.** A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart.

**Word embeddings (brief history).** Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king − man + woman ≈ queen). These assign one vector per word regardless of context.

**Contextual embeddings from encoder models.** Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a `Vec<f32>` of fixed length. The same word in different contexts produces different vectors.

**What makes a good embedding model.** Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost.

**Practical dimensionalities.** 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset.

**Embeddings for non-text data.** Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality.