+++ title = "ยง2 Embeddings" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++ ## ยง2 Embeddings โ€” Stub to fill File: `edu/src/vector-db.md`, section `### 2. Embeddings` Replace this stub line with full content: > Embeddings are the bridge between raw data and vector space. [...] ๐Ÿšง Full content tracked in [nbd:584e0c]. This is a **reading lesson** โ€” no Rust code. Target 400โ€“700 words. Follow the style of ยง1 in the same file: prose paragraphs with bold lead phrases. ## Learning objectives - Understand what an embedding is: a learned function E(x) โ†’ โ„แตˆ mapping inputs to vectors - Know why geometric proximity in embedding space corresponds to semantic similarity (training objective) - Understand how contextual encoder models (BERT-style) produce sentence embeddings - Know typical output dimensionalities (384, 768, 1536) and what influences them - Understand that embedding axes are not individually interpretable โ€” meaning lives in relative positions ## Content to write **What an embedding is.** A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors โ€” semantically related sentences end up close in vector space; unrelated sentences end up far apart. **Word embeddings (brief history).** Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king โˆ’ man + woman โ‰ˆ queen). These assign one vector per word regardless of context. **Contextual embeddings from encoder models.** Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals โ€” just: input is a string, output is a `Vec` of fixed length. The same word in different contexts produces different vectors. **What makes a good embedding model.** Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost. **Practical dimensionalities.** 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better โ€” depends on task and dataset. **Embeddings for non-text data.** Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality.