docs(edu): write §2 embeddings for vector-db course [584e0c]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
main
Elijah Voigt 3 months ago
parent b6017c1e58
commit 3a1860b5b1

@ -62,7 +62,17 @@ This is not magic; it is the result of training a model to produce embeddings wh
### 2. Embeddings ### 2. Embeddings
Embeddings are the bridge between raw data and vector space. This section covers how language models, image encoders, and other neural networks learn to map heterogeneous inputs — words, sentences, images, products — into vectors where geometric proximity captures semantic similarity. 🚧 Full content tracked in [nbd:584e0c]. **What an embedding is.** An embedding is a function — learned from data, not hand-crafted — that maps an input to a fixed-size vector of floating-point numbers. The input can be a word, a sentence, an image, a product listing, or anything else that can be fed into a neural network. The output is always the same shape: a `Vec<f32>` of some predetermined length *d*. The function is trained so that inputs with similar meaning produce vectors that are close together in the *d*-dimensional space, while unrelated inputs produce vectors that are far apart. Once you have such a function, comparing the meaning of two inputs reduces to comparing their vectors — which is exactly the geometric problem that vector databases are built to solve.
**Word embeddings and a brief history.** The idea that meaning can live in a vector took hold in 2013 when Mikolov et al. published Word2Vec. Word2Vec trains on raw text and assigns every word in its vocabulary a single static vector, typically of 100 to 300 dimensions. The striking result was that vector arithmetic captured semantic relationships: the vector for *king* minus the vector for *man* plus the vector for *woman* produced a vector closest to *queen*. GloVe (2014) and fastText (2016) refined the approach, but the core limitation remained — each word gets exactly one vector regardless of context. The word *bank* has the same embedding whether it refers to a riverbank or a financial institution. Static word embeddings are largely a historical curiosity today, but they established the foundational principle: meaning can be encoded as geometry.
**Contextual embeddings from encoder models.** Modern embedding models solve the polysemy problem by reading the entire input before producing a vector. Models such as those in the sentence-transformers library or OpenAI's text-embedding-3-small take a full sentence (or paragraph) as input, process it through a transformer encoder, and output one vector that represents the whole input. Internally these models produce a vector for every token; the single sentence-level vector is obtained either by averaging all token vectors (mean-pooling) or by taking the vector at a special `[CLS]` token position. You do not need to understand transformer internals to use these models — the interface is simple: input is a string, output is a `Vec<f32>` of fixed length. Because the model sees the full context, the same word in different sentences yields different final embeddings, correctly distinguishing *river bank* from *investment bank*.
**What makes a good embedding model.** Embedding models are trained with a **contrastive objective**: given a pair of inputs known to be similar (a question and its answer, two paraphrases, a caption and its image), the loss function pulls their vectors closer together; given a dissimilar pair, it pushes them apart. The quality of the training data — how many pairs, how diverse, how accurately labelled — matters as much as model size. Models are evaluated on the **MTEB** (Massive Text Embedding Benchmark), which measures performance across retrieval, classification, clustering, and semantic similarity tasks. In general, larger models produce better embeddings but cost more compute per input and return higher-dimensional vectors that consume more storage.
**Practical dimensionalities.** Different models produce different vector sizes, and the choice affects speed, memory, and quality. Common dimensions include **384** (MiniLM — fast inference, model size around 80130 MB, a good default for prototyping), **768** (BERT-base and many sentence-transformers models — the most common open-source default), **1 536** (OpenAI text-embedding-3-small — a strong hosted option balancing quality and cost), and **3 072** (OpenAI text-embedding-3-large — highest quality from OpenAI at roughly double the cost). Higher dimensionality is not always better: on small datasets or narrow domains, a 384-dimensional model may match or outperform a 1 536-dimensional one while using a quarter of the storage and running faster at query time. Choose based on your task, your latency budget, and empirical evaluation — not on the assumption that bigger is automatically better.
**Embeddings for non-text data.** Vectors are not limited to language. **CLIP** (Contrastive Language-Image Pretraining) trains a text encoder and an image encoder jointly so that their output vectors inhabit the same space — a photo of a dog and the sentence "a photograph of a dog" end up near each other, enabling text-to-image and image-to-text search with no modality-specific logic. Product embeddings can be learned from purchase co-occurrence: items frequently bought together are trained to have nearby vectors, powering recommendation engines. Audio, code, and molecular structures have their own embedding models. The vector database does not care what produced the floats — it stores arrays of `f32` and computes distances. This modality-agnostic storage is one of the reasons vector databases have become a general-purpose building block in modern AI systems.
--- ---

Loading…
Cancel
Save