|
|
+++
|
|
|
title = "§2 Embeddings"
|
|
|
priority = 5
|
|
|
status = "done"
|
|
|
ticket_type = "task"
|
|
|
dependencies = []
|
|
|
+++
|
|
|
## §2 Embeddings — Stub to fill
|
|
|
|
|
|
File: `edu/src/vector-db.md`, section `### 2. Embeddings`
|
|
|
|
|
|
Replace this stub line with full content:
|
|
|
> Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c].
|
|
|
|
|
|
This is a **reading lesson** — no Rust code. Target 400–700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases.
|
|
|
|
|
|
## Learning objectives
|
|
|
|
|
|
- Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors
|
|
|
- Know why geometric proximity in embedding space corresponds to semantic similarity (training objective)
|
|
|
- Understand how contextual encoder models (BERT-style) produce sentence embeddings
|
|
|
- Know typical output dimensionalities (384, 768, 1536) and what influences them
|
|
|
- Understand that embedding axes are not individually interpretable — meaning lives in relative positions
|
|
|
|
|
|
## Content to write
|
|
|
|
|
|
**What an embedding is.** A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart.
|
|
|
|
|
|
**Word embeddings (brief history).** Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king − man + woman ≈ queen). These assign one vector per word regardless of context.
|
|
|
|
|
|
**Contextual embeddings from encoder models.** Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a `Vec<f32>` of fixed length. The same word in different contexts produces different vectors.
|
|
|
|
|
|
**What makes a good embedding model.** Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost.
|
|
|
|
|
|
**Practical dimensionalities.** 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset.
|
|
|
|
|
|
**Embeddings for non-text data.** Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality. |