You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/edu-hvic--2-embeddings.md

41 lines
2.7 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
# edu-hvic
title: §2 Embeddings
status: completed
type: task
priority: normal
created_at: 2026-03-10T23:30:01Z
updated_at: 2026-03-10T23:30:01Z
---
## §2 Embeddings — Stub to fill
File: `edu/src/vector-db.md`, section `### 2. Embeddings`
Replace this stub line with full content:
> Embeddings are the bridge between raw data and vector space. [...] 🚧 Full content tracked in [nbd:584e0c].
This is a **reading lesson** — no Rust code. Target 400700 words. Follow the style of §1 in the same file: prose paragraphs with bold lead phrases.
## Learning objectives
- Understand what an embedding is: a learned function E(x) → ℝᵈ mapping inputs to vectors
- Know why geometric proximity in embedding space corresponds to semantic similarity (training objective)
- Understand how contextual encoder models (BERT-style) produce sentence embeddings
- Know typical output dimensionalities (384, 768, 1536) and what influences them
- Understand that embedding axes are not individually interpretable — meaning lives in relative positions
## Content to write
**What an embedding is.** A function learned from data that maps an input (word, sentence, image, product) to a fixed-size float vector. The function is trained so that similar inputs produce nearby vectors — semantically related sentences end up close in vector space; unrelated sentences end up far apart.
**Word embeddings (brief history).** Word2Vec (2013) showed that word meaning could be encoded as static vectors where arithmetic worked (king man + woman ≈ queen). These assign one vector per word regardless of context.
**Contextual embeddings from encoder models.** Modern models (sentence-transformers, OpenAI text-embedding-3-small) produce one vector for the entire input sentence via mean-pooling or a [CLS] token. The reader does not need to understand transformer internals — just: input is a string, output is a `Vec<f32>` of fixed length. The same word in different contexts produces different vectors.
**What makes a good embedding model.** Training uses contrastive learning: pull similar pairs together, push dissimilar pairs apart. Models are evaluated on MTEB (Massive Text Embedding Benchmark). Larger models generally produce better embeddings at higher cost.
**Practical dimensionalities.** 384 (MiniLM, fast, ~130MB), 768 (BERT-base, sentence-transformers default), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Larger is not always better — depends on task and dataset.
**Embeddings for non-text data.** Brief mention: CLIP produces image embeddings comparable to text embeddings in the same space (enabling text-to-image search). Product embeddings can be learned from purchase co-occurrence. The vector database stores float arrays regardless of modality.