vibed

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2.7 KiB

Raw Blame History Unescape Escape

+++ title = "§2 Embeddings" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++

§2 Embeddings — Stub to fill

File: edu/src/vector-db.md, section ### 2. Embeddings

Replace this stub line with full content:

Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c].

This is a reading lesson — no Rust code. Target 400–700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases.

Learning objectives

Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors
Know why geometric proximity in embedding space corresponds to semantic similarity (training objective)
Understand how contextual encoder models (BERT-style) produce sentence embeddings
Know typical output dimensionalities (384, 768, 1536) and what influences them
Understand that embedding axes are not individually interpretable — meaning lives in relative positions

Content to write

What an embedding is. A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart.

Word embeddings (brief history). Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king − man + woman ≈ queen). These assign one vector per word regardless of context.

Contextual embeddings from encoder models. Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a Vec<f32> of fixed length. The same word in different contexts produces different vectors.

What makes a good embedding model. Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost.

Practical dimensionalities. 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset.

Embeddings for non-text data. Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality.

2.7 KiB Raw Blame History Unescape Escape

§2 Embeddings — Stub to fill

Learning objectives

Content to write

2.7 KiB

Raw Blame History Unescape Escape