You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/src/vector-db.md

15 KiB

Vector Database Self-Guided Course

This document is a self-guided course on vector databases. It is organized into four parts: conceptual foundations, the internals of vector search systems, hands-on Rust exercises with Turso and sqlite-vec, and real-world application pipelines. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked 🚧 are stubs whose full content is tracked in an nbd ticket — follow the ticket ID to find the detailed learning objectives and instructions.


Table of Contents

Part 1 — Foundations

  1. What Is a Vector?
  2. Embeddings
  3. Vector Similarity

Part 2 — Vector Databases

  1. What Is a Vector Database?
  2. Under the Hood: ANN Algorithms

Part 3 — Turso + sqlite-vec Basics

  1. Setting Up
  2. Exercise 1 — Storing and Retrieving Vectors
  3. Exercise 2 — K-Nearest Neighbor Search

Part 4 — Real Applications

  1. Generating Embeddings in Rust
  2. Exercise 3 — Semantic Document Search
  3. Exercise 4 — Recommendation Engine
  4. Exercise 5 — Retrieval-Augmented Generation

Part 1 — Foundations

1. What Is a Vector?

A vector is an ordered list of numbers. That is the entire definition — nothing more exotic than a list where position matters. A two-element list [3.0, 4.0] is a vector; so is a 1 536-element list of floating-point values produced by a language model. What makes vectors useful is that the numbers have a geometric interpretation: each element is a coordinate along one axis of a space, and the vector as a whole names a point (or an arrow from the origin to that point) in that space.

Geometric intuition in two and three dimensions. Start with the familiar. A 2-dimensional vector [x, y] is a point in the plane — the kind you plot on graph paper. The vector [3.0, 4.0] sits three units to the right of the origin and four units up. An arrow drawn from [0, 0] to [3, 4] has a magnitude (length) of √(3² + 4²) = 5 and points in a specific direction. Magnitude and direction together completely characterise the vector; change either one and you have a different vector.

A 3-dimensional vector [x, y, z] extends this to physical space: three coordinates, three axes, one point. You can still compute a magnitude — √(x² + y² + z²) — and you can still talk about direction. Two 3D vectors point in the same direction if one is a positive scalar multiple of the other; they are perpendicular (orthogonal) if their dot product is zero.

High-dimensional spaces. Nothing in the definition of a vector limits it to two or three elements. A d-dimensional vector [x₁, x₂, …, x_d] is a point in d-dimensional space. The geometry extends perfectly: magnitude is √(x₁² + x₂² + … + x_d²), the dot product of two vectors is Σᵢ aᵢ · bᵢ, and you can compute angles and distances between points just as you would in 2D or 3D.

High-dimensional geometry is counterintuitive in subtle ways that are worth knowing:

  • The curse of dimensionality. In high-dimensional spaces, most of the volume of a hypersphere is concentrated near its surface rather than its interior. Two randomly chosen high-dimensional vectors from a standard distribution tend to be nearly orthogonal — their dot product is close to zero — even when you have not deliberately constructed them that way. This means "nearest neighbour" in high dimensions is a harder problem than it sounds: there are exponentially many directions, and nearby points can seem far away using simple distance measures.

  • Normalisation changes the geometry. A unit vector has magnitude exactly 1. Dividing a vector by its magnitude — normalisation — projects all vectors onto the surface of the unit hypersphere. On that sphere, distance and angle are equivalent measures of similarity, which simplifies many computations. Embedding models often output unit-normalised vectors precisely to exploit this equivalence.

  • Dimensions are not independent features. When people say a language model embeds words into a 768-dimensional space, they do not mean "dimension 42 encodes the concept of colour." The axes of an embedding space are rarely interpretable on their own. Meaning is encoded in the relative positions of points — which vectors are close to which others — not in the values along any single axis.

Vectors as representations. The key insight that makes vector databases useful is that real-world objects — documents, images, audio clips, products, users — can be represented as vectors such that similarity in meaning or content corresponds to proximity in the vector space. Two documents that discuss the same topic will, if embedded well, produce vectors that are close together. Two documents on unrelated topics will produce vectors that are far apart.

This is not magic; it is the result of training a model to produce embeddings where similar inputs cluster near each other. Once you have such a model, every search or comparison problem reduces to a geometric problem: find the vectors closest to a query vector. The rest of this course is about how to do that efficiently at scale.

A note on notation. Throughout this course, vectors are written in bold or with subscripts: v, q, or v₁. The i-th element of a vector v is written v[i] or vᵢ. The magnitude of v is written |v| or ‖v‖. Dimension is written d and the number of stored vectors is written n.


2. Embeddings

Embeddings are the bridge between raw data and vector space. This section covers how language models, image encoders, and other neural networks learn to map heterogeneous inputs — words, sentences, images, products — into vectors where geometric proximity captures semantic similarity. 🚧 Full content tracked in [nbd:584e0c].


3. Vector Similarity

Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity functions used in vector search: cosine similarity, dot product, and Euclidean distance — their formulas, geometric interpretations, when each is appropriate, and the trade-offs in choosing between them. 🚧 Full content tracked in [nbd:99e1d9].


Part 2 — Vector Databases

4. What Is a Vector Database?

A vector database is a data store built around one core operation: given a query vector q, return the k stored vectors most similar to q. This section covers what that means in practice — approximate nearest-neighbour (ANN) search, the use cases that make vector databases essential (semantic search, recommendations, RAG), and how they differ from traditional relational or key-value databases. 🚧 Full content tracked in [nbd:d9f850].


5. Under the Hood: ANN Algorithms

Exact nearest-neighbour search over millions of high-dimensional vectors is too slow for production use. This section explains the two dominant approximate methods — HNSW (Hierarchical Navigable Small World graphs) and IVFFlat (Inverted File with flat quantisation) — their index construction, query-time traversal, and the recall vs. latency trade-off each exposes. 🚧 Full content tracked in [nbd:6ec5ff].


Part 3 — Turso + sqlite-vec Basics

6. Setting Up

This section walks through everything you need before writing a single SQL query: adding the right crates, opening a local Turso connection, and loading the sqlite-vec extension that gives SQLite vector-search superpowers.

What You Are Building

Turso is a SQLite-compatible database with built-in support for vector similarity search via the sqlite-vec extension. In local development you use a file-backed SQLite database; in production the same code points at a Turso cloud database. The libsql crate (the Rust client for Turso) speaks the Turso wire protocol and also handles local SQLite files transparently.

Cargo.toml

Create a new binary project and add the following dependencies:

cargo new vec-demo
cd vec-demo

Replace the [dependencies] section of Cargo.toml with:

[dependencies]
libsql = "0.9"
tokio = { version = "1", features = ["full"] }

libsql is the official Rust client for Turso / libSQL databases. It supports both local SQLite files and remote Turso connections with the same API, making it straightforward to develop locally and deploy to the cloud. tokio provides the async runtime — all libsql operations are async.

Add the release-build optimisation profile from the project conventions:

[profile.release]
opt-level = "z"
lto = true
strip = true
codegen-units = 1

Opening a Local Connection

Replace src/main.rs with the following:

use libsql::{Builder, Database};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let db: Database = Builder::new_local("vectors.db").build().await?;
    let conn = db.connect()?;

    // Verify the connection works
    let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
    if let Some(row) = rows.next().await? {
        let version: String = row.get(0)?;
        println!("SQLite version: {version}");
    }

    Ok(())
}

Run it with cargo run. You should see output like:

SQLite version: 3.46.0

A file named vectors.db will appear in the current directory. This is a standard SQLite database — you can open it with any SQLite client to inspect its contents.

Enabling Vector Support with sqlite-vec

The libsql crate ships with sqlite-vec built in. No separate installation is required. Vector functions become available automatically once you use the right column types and functions in your SQL.

The key types and functions you will use throughout this course:

Construct Purpose
F32_BLOB(d) Column type for storing a d-dimensional float32 vector
vector(json_array) Creates a vector from a JSON array literal
vector_extract(blob) Converts a stored vector blob back to a JSON array
vector_distance_cos(a, b) Cosine distance between two vectors (0 = identical, 2 = opposite)
libsql_vector_idx(col) Index type for fast approximate nearest-neighbour search
vector_top_k(table, query, k) Table-valued function: returns the k nearest rows to a query vector

Creating a Vector Table

Extend main to create a table that stores 3-dimensional float32 vectors:

conn.execute(
    "CREATE TABLE IF NOT EXISTS items (
         id      INTEGER PRIMARY KEY,
         label   TEXT NOT NULL,
         embedding F32_BLOB(3) NOT NULL
     )",
    (),
).await?;

F32_BLOB(3) declares a column that holds a 3-dimensional float32 vector stored as a binary blob. The 3 is the dimensionality — use the actual size of your embedding model's output (e.g., F32_BLOB(768) for a 768-dimensional model) in real projects.

Creating a Vector Index

Without an index, nearest-neighbour search performs a full table scan — computing the distance from the query to every stored vector. For small tables this is fine; at scale you need an index:

conn.execute(
    "CREATE INDEX IF NOT EXISTS items_vec_idx
         ON items (embedding)
         USING libsql_vector_idx(embedding)",
    (),
).await?;

This creates an HNSW index over the embedding column. Queries that use vector_top_k will automatically use this index. The index is updated incrementally as rows are inserted or deleted — no manual rebuild is required.

Putting It Together

At this point your main.rs should look like this:

use libsql::{Builder, Database};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let db: Database = Builder::new_local("vectors.db").build().await?;
    let conn = db.connect()?;

    // Verify connection
    let mut rows = conn.query("SELECT sqlite_version()", ()).await?;
    if let Some(row) = rows.next().await? {
        let version: String = row.get(0)?;
        println!("SQLite version: {version}");
    }

    // Create vector table
    conn.execute(
        "CREATE TABLE IF NOT EXISTS items (
             id        INTEGER PRIMARY KEY,
             label     TEXT NOT NULL,
             embedding F32_BLOB(3) NOT NULL
         )",
        (),
    ).await?;

    // Create HNSW index
    conn.execute(
        "CREATE INDEX IF NOT EXISTS items_vec_idx
             ON items (embedding)
             USING libsql_vector_idx(embedding)",
        (),
    ).await?;

    println!("Database ready.");
    Ok(())
}

cargo run should print:

SQLite version: 3.46.0
Database ready.

You now have a working local vector database. Exercises 1 through 5 build on this foundation, adding data, querying it, and connecting the full embedding-to-search pipeline.


7. Exercise 1 — Storing and Retrieving Vectors

Goal: Insert a small set of labelled vectors into the items table created in §6, then retrieve them with a SELECT and deserialize the stored blob back into a Rust Vec<f32>. 🚧 Full content tracked in [nbd:081a55].


Goal: Use vector_top_k and vector_distance_cos to find the k vectors in the database most similar to a query vector, and display the results ranked by similarity score. 🚧 Full content tracked in [nbd:5674ce].


Part 4 — Real Applications

9. Generating Embeddings in Rust

Before you can search by meaning, you need a way to convert text into vectors. This section covers two approaches available in Rust: running a local embedding model with fastembed-rs (no API key, works offline, suited for smaller models) and calling an HTTP embedding API such as the OpenAI Embeddings endpoint (larger, higher-quality models at the cost of latency and a network dependency). 🚧 Full content tracked in [nbd:4c961f].


Goal: Build a complete semantic search pipeline: embed a small corpus of text documents, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-k most relevant documents using vector similarity — all without any keyword matching. 🚧 Full content tracked in [nbd:1ef9f4].


11. Exercise 4 — Recommendation Engine

Goal: Implement item-based collaborative filtering using vector similarity. Store item feature vectors (or learned item embeddings) in Turso, then given a target item, retrieve the k most similar items as recommendations. 🚧 Full content tracked in [nbd:e8be9a].


12. Exercise 5 — Retrieval-Augmented Generation

Goal: Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model's grounded answer back to the user. 🚧 Full content tracked in [nbd:5ed295].