2.7 KiB
+++ title = "§2 Embeddings" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++
§2 Embeddings — Stub to fill
File: edu/src/vector-db.md, section ### 2. Embeddings
Replace this stub line with full content:
Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c].
This is a reading lesson — no Rust code. Target 400–700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases.
Learning objectives
- Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors
- Know why geometric proximity in embedding space corresponds to semantic similarity (training objective)
- Understand how contextual encoder models (BERT-style) produce sentence embeddings
- Know typical output dimensionalities (384, 768, 1536) and what influences them
- Understand that embedding axes are not individually interpretable — meaning lives in relative positions
Content to write
What an embedding is. A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart.
Word embeddings (brief history). Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king − man + woman ≈ queen). These assign one vector per word regardless of context.
Contextual embeddings from encoder models. Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a Vec<f32> of fixed length. The same word in different contexts produces different vectors.
What makes a good embedding model. Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost.
Practical dimensionalities. 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset.
Embeddings for non-text data. Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality.