You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/archive/edu-hvmi--4-what-is-a-vecto...

50 lines
3.2 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
# edu-hvmi
title: §4 What Is a Vector Database?
status: completed
type: task
priority: normal
created_at: 2026-03-10T23:30:02Z
updated_at: 2026-03-10T23:30:02Z
---
## §4 What Is a Vector Database? — Stub to fill
File: `edu/src/vector-db.md`, section `### 4. What Is a Vector Database?`
Replace this stub line with full content:
> A vector database is a data store built around one core operation [...] 🚧 Full content tracked in [nbd:d9f850].
This is a **reading lesson** — no Rust code. Target 500700 words. Bold lead phrases.
## Learning objectives
- Understand the core operation: approximate nearest-neighbour (ANN) search
- Know the primary use cases that motivate vector databases
- Understand how vector databases differ from relational DBs and full-text search
- Know the key performance metrics: recall@k, QPS, index build time, memory
## Content to write
**The core operation.** Given a query vector q and n stored vectors, find the k vectors most similar to q. Exact KNN is O(n·d) per query — at n=1M and d=768 this means 768M operations per query, too slow for interactive use. Vector databases use ANN algorithms (see §5) that trade a small accuracy loss for orders-of-magnitude speed gains.
**Use cases.** Each described in one concrete sentence:
- Semantic search: find documents matching the *meaning* of a query, not just the words
- Recommendation: given an item, return the k most similar items (§11) or surface content preferred by similar users
- Retrieval-Augmented Generation (RAG): retrieve relevant passages before prompting an LLM, so the answer is grounded in facts (§12)
- Duplicate/near-duplicate detection: find items semantically identical or very close to a given item
- Anomaly detection: items far from all stored vectors are likely anomalous
- Multi-modal search: find images matching a text description, using CLIP-style joint embeddings
**vs. relational databases.** SQL WHERE clauses do exact matches and range queries on scalar values. There is no built-in notion of "nearest" for float arrays. Extensions like pgvector (PostgreSQL) and sqlite-vec (SQLite / Turso) add vector search to existing databases — this course uses sqlite-vec via the `libsql` crate.
**vs. full-text search (BM25/TF-IDF).** Keyword search cannot handle synonymy (car ≠ automobile without explicit expansion) or concept-level similarity. Vector search captures both. Hybrid search — combining BM25 and ANN scores — is a common production pattern that outperforms either alone.
**Key metrics.**
- Recall@k: fraction of the true k nearest neighbours that the ANN algorithm returns. A recall@10 of 0.95 means 95% of correct results are found.
- QPS: queries per second the index can serve at a given recall target.
- Index build time: one-time cost paid before serving queries.
- Memory footprint: HNSW stores graph edges in RAM; this limits how large the index can grow on a single machine.
**Where sqlite-vec / Turso fits.** sqlite-vec is appropriate for embedded applications, local development, and small-to-medium corpora (up to a few million vectors). Dedicated cloud vector databases (Pinecone, Qdrant, Weaviate) handle larger scale and add features like multi-tenancy, filtering, and distributed search.