# Vector Database Self-Guided Course This document is a self-guided course on vector databases. It is organized into four parts: conceptual foundations, the internals of vector search systems, hands-on Rust exercises with Turso and sqlite-vec, and real-world application pipelines. Each section is either a reading lesson or a hands-on Rust programming exercise. Sections marked ๐Ÿšง are stubs whose full content is tracked in an `nbd` ticket โ€” follow the ticket ID to find the detailed learning objectives and instructions. --- ## Table of Contents **Part 1 โ€” Foundations** 1. [What Is a Vector?](#1-what-is-a-vector) 2. [Embeddings](#2-embeddings) 3. [Vector Similarity](#3-vector-similarity) **Part 2 โ€” Vector Databases** 4. [What Is a Vector Database?](#4-what-is-a-vector-database) 5. [Under the Hood: ANN Algorithms](#5-under-the-hood-ann-algorithms) **Part 3 โ€” Turso + sqlite-vec Basics** 6. [Setting Up](#6-setting-up) 7. [Exercise 1 โ€” Storing and Retrieving Vectors](#7-exercise-1--storing-and-retrieving-vectors) 8. [Exercise 2 โ€” K-Nearest Neighbor Search](#8-exercise-2--k-nearest-neighbor-search) **Part 4 โ€” Real Applications** 9. [Generating Embeddings in Rust](#9-generating-embeddings-in-rust) 10. [Exercise 3 โ€” Semantic Document Search](#10-exercise-3--semantic-document-search) 11. [Exercise 4 โ€” Recommendation Engine](#11-exercise-4--recommendation-engine) 12. [Exercise 5 โ€” Retrieval-Augmented Generation](#12-exercise-5--retrieval-augmented-generation) --- ## Part 1 โ€” Foundations ### 1. What Is a Vector? A **vector** is an ordered list of numbers. That is the entire definition โ€” nothing more exotic than a list where position matters. A two-element list `[3.0, 4.0]` is a vector; so is a 1 536-element list of floating-point values produced by a language model. What makes vectors useful is that the numbers have a geometric interpretation: each element is a coordinate along one axis of a space, and the vector as a whole names a point (or an arrow from the origin to that point) in that space. **Geometric intuition in two and three dimensions.** Start with the familiar. A 2-dimensional vector `[x, y]` is a point in the plane โ€” the kind you plot on graph paper. The vector `[3.0, 4.0]` sits three units to the right of the origin and four units up. An arrow drawn from `[0, 0]` to `[3, 4]` has a **magnitude** (length) of `โˆš(3ยฒ + 4ยฒ) = 5` and points in a specific **direction**. Magnitude and direction together completely characterise the vector; change either one and you have a different vector. A 3-dimensional vector `[x, y, z]` extends this to physical space: three coordinates, three axes, one point. You can still compute a magnitude โ€” `โˆš(xยฒ + yยฒ + zยฒ)` โ€” and you can still talk about direction. Two 3D vectors point in the same direction if one is a positive scalar multiple of the other; they are **perpendicular** (orthogonal) if their dot product is zero. **High-dimensional spaces.** Nothing in the definition of a vector limits it to two or three elements. A *d*-dimensional vector `[xโ‚, xโ‚‚, โ€ฆ, x_d]` is a point in *d*-dimensional space. The geometry extends perfectly: magnitude is `โˆš(xโ‚ยฒ + xโ‚‚ยฒ + โ€ฆ + x_dยฒ)`, the dot product of two vectors is `ฮฃแตข aแตข ยท bแตข`, and you can compute angles and distances between points just as you would in 2D or 3D. High-dimensional geometry is counterintuitive in subtle ways that are worth knowing: - **The curse of dimensionality.** In high-dimensional spaces, most of the volume of a hypersphere is concentrated near its surface rather than its interior. Two randomly chosen high-dimensional vectors from a standard distribution tend to be nearly orthogonal โ€” their dot product is close to zero โ€” even when you have not deliberately constructed them that way. This means "nearest neighbour" in high dimensions is a harder problem than it sounds: there are exponentially many directions, and nearby points can seem far away using simple distance measures. - **Normalisation changes the geometry.** A **unit vector** has magnitude exactly 1. Dividing a vector by its magnitude โ€” **normalisation** โ€” projects all vectors onto the surface of the unit hypersphere. On that sphere, distance and angle are equivalent measures of similarity, which simplifies many computations. Embedding models often output unit-normalised vectors precisely to exploit this equivalence. - **Dimensions are not independent features.** When people say a language model embeds words into a 768-dimensional space, they do not mean "dimension 42 encodes the concept of colour." The axes of an embedding space are rarely interpretable on their own. Meaning is encoded in the *relative positions* of points โ€” which vectors are close to which others โ€” not in the values along any single axis. **Vectors as representations.** The key insight that makes vector databases useful is that real-world objects โ€” documents, images, audio clips, products, users โ€” can be represented as vectors such that *similarity in meaning or content corresponds to proximity in the vector space*. Two documents that discuss the same topic will, if embedded well, produce vectors that are close together. Two documents on unrelated topics will produce vectors that are far apart. This is not magic; it is the result of training a model to produce embeddings where similar inputs cluster near each other. Once you have such a model, every search or comparison problem reduces to a geometric problem: find the vectors closest to a query vector. The rest of this course is about how to do that efficiently at scale. **A note on notation.** Throughout this course, vectors are written in bold or with subscripts: **v**, **q**, or `vโ‚`. The *i*-th element of a vector **v** is written `v[i]` or `vแตข`. The magnitude of **v** is written `|v|` or `โ€–vโ€–`. Dimension is written *d* and the number of stored vectors is written *n*. --- ### 2. Embeddings **What an embedding is.** An embedding is a function โ€” learned from data, not hand-crafted โ€” that maps an input to a fixed-size vector of floating-point numbers. The input can be a word, a sentence, an image, a product listing, or anything else that can be fed into a neural network. The output is always the same shape: a `Vec` of some predetermined length *d*. The function is trained so that inputs with similar meaning produce vectors that are close together in the *d*-dimensional space, while unrelated inputs produce vectors that are far apart. Once you have such a function, comparing the meaning of two inputs reduces to comparing their vectors โ€” which is exactly the geometric problem that vector databases are built to solve. **Word embeddings and a brief history.** The idea that meaning can live in a vector took hold in 2013 when Mikolov et al. published Word2Vec. Word2Vec trains on raw text and assigns every word in its vocabulary a single static vector, typically of 100 to 300 dimensions. The striking result was that vector arithmetic captured semantic relationships: the vector for *king* minus the vector for *man* plus the vector for *woman* produced a vector closest to *queen*. GloVe (2014) and fastText (2016) refined the approach, but the core limitation remained โ€” each word gets exactly one vector regardless of context. The word *bank* has the same embedding whether it refers to a riverbank or a financial institution. Static word embeddings are largely a historical curiosity today, but they established the foundational principle: meaning can be encoded as geometry. **Contextual embeddings from encoder models.** Modern embedding models solve the polysemy problem by reading the entire input before producing a vector. Models such as those in the sentence-transformers library or OpenAI's text-embedding-3-small take a full sentence (or paragraph) as input, process it through a transformer encoder, and output one vector that represents the whole input. Internally these models produce a vector for every token; the single sentence-level vector is obtained either by averaging all token vectors (mean-pooling) or by taking the vector at a special `[CLS]` token position. You do not need to understand transformer internals to use these models โ€” the interface is simple: input is a string, output is a `Vec` of fixed length. Because the model sees the full context, the same word in different sentences yields different final embeddings, correctly distinguishing *river bank* from *investment bank*. **What makes a good embedding model.** Embedding models are trained with a **contrastive objective**: given a pair of inputs known to be similar (a question and its answer, two paraphrases, a caption and its image), the loss function pulls their vectors closer together; given a dissimilar pair, it pushes them apart. The quality of the training data โ€” how many pairs, how diverse, how accurately labelled โ€” matters as much as model size. Models are evaluated on the **MTEB** (Massive Text Embedding Benchmark), which measures performance across retrieval, classification, clustering, and semantic similarity tasks. In general, larger models produce better embeddings but cost more compute per input and return higher-dimensional vectors that consume more storage. **Practical dimensionalities.** Different models produce different vector sizes, and the choice affects speed, memory, and quality. Common dimensions include **384** (MiniLM โ€” fast inference, model size around 80โ€“130 MB, a good default for prototyping), **768** (BERT-base and many sentence-transformers models โ€” the most common open-source default), **1 536** (OpenAI text-embedding-3-small โ€” a strong hosted option balancing quality and cost), and **3 072** (OpenAI text-embedding-3-large โ€” highest quality from OpenAI at roughly double the cost). Higher dimensionality is not always better: on small datasets or narrow domains, a 384-dimensional model may match or outperform a 1 536-dimensional one while using a quarter of the storage and running faster at query time. Choose based on your task, your latency budget, and empirical evaluation โ€” not on the assumption that bigger is automatically better. **Embeddings for non-text data.** Vectors are not limited to language. **CLIP** (Contrastive Language-Image Pretraining) trains a text encoder and an image encoder jointly so that their output vectors inhabit the same space โ€” a photo of a dog and the sentence "a photograph of a dog" end up near each other, enabling text-to-image and image-to-text search with no modality-specific logic. Product embeddings can be learned from purchase co-occurrence: items frequently bought together are trained to have nearby vectors, powering recommendation engines. Audio, code, and molecular structures have their own embedding models. The vector database does not care what produced the floats โ€” it stores arrays of `f32` and computes distances. This modality-agnostic storage is one of the reasons vector databases have become a general-purpose building block in modern AI systems. --- ### 3. Vector Similarity Once you have two vectors, how do you measure how alike they are? This section covers the three most common similarity and distance functions used in vector search โ€” their formulas, geometric interpretations, and trade-offs โ€” then works through a concrete example so the arithmetic is familiar before you encounter these functions in SQL. **Cosine similarity.** The cosine similarity of two vectors **a** and **b** is defined as cos(ฮธ) = (a ยท b) / (โ€–aโ€– ยท โ€–bโ€–), where a ยท b is the dot product and โ€–aโ€– is the magnitude of **a**. The result ranges from โˆ’1 to 1: a value of 1 means the vectors point in exactly the same direction, 0 means they are orthogonal (perpendicular), and โˆ’1 means they point in exactly opposite directions. The critical property of cosine similarity is that it measures only the *angle* between vectors, ignoring their magnitudes entirely. This makes it ideal for text embeddings: a short document and a long document on the same topic may produce vectors that differ in magnitude but point in nearly the same direction, and cosine similarity correctly identifies them as similar. **Cosine distance.** Cosine distance is simply 1 โˆ’ cosine_similarity. Its range is 0 to 2, where 0 means the vectors are identical in direction and 2 means they are fully opposite. This is what sqlite-vec's `vector_distance_cos` function returns. Pay attention to the naming: the function name contains "cos" but it returns a *distance*, not a similarity โ€” smaller values mean more similar vectors, not less. This is a common source of confusion when writing queries for the first time. **Dot product.** The dot product of two vectors **a** and **b** is a ยท b = ฮฃแตข aแตขbแตข โ€” multiply corresponding elements and sum the results. For **unit-normalised** vectors (vectors whose magnitude is exactly 1), the dot product equals cosine similarity, because the denominator โ€–aโ€– ยท โ€–bโ€– = 1 ยท 1 = 1 and cancels out. For unnormalised vectors, the dot product conflates magnitude and direction: a longer vector will produce a larger dot product even if the angle is the same. Some embedding models are trained specifically for maximum inner product search (MIPS), meaning their vectors are *not* unit-normalised and the raw dot product is the intended similarity metric. The model's documentation or model card will say so when this is the case. **Euclidean (L2) distance.** The Euclidean distance between two vectors is โ€–a โˆ’ bโ€– = โˆš(ฮฃแตข (aแตข โˆ’ bแตข)ยฒ) โ€” the straight-line distance between two points in *d*-dimensional space. Its range is 0 to โˆž, with 0 meaning the vectors are identical. Unlike cosine similarity, L2 distance is sensitive to vector magnitude: two vectors pointing in the same direction but with different lengths will have a non-zero L2 distance. L2 is most appropriate for low-dimensional geometric or tabular data where absolute coordinate values carry meaning โ€” for example, geographic coordinates or sensor readings. **When to use each.** For text and sentence embeddings, use cosine similarity (or equivalently, dot product if your model outputs unit-normalised vectors, which many do). When in doubt, follow the recommendation on the model card. For low-dimensional geometric features where absolute position matters, use L2 distance. **Worked example.** Let **a** = [1, 0, 1] and **b** = [1, 1, 0]. Compute all three metrics by hand: *Dot product:* a ยท b = (1)(1) + (0)(1) + (1)(0) = 1 + 0 + 0 = 1. *Magnitudes:* โ€–aโ€– = โˆš(1ยฒ + 0ยฒ + 1ยฒ) = โˆš2 โ‰ˆ 1.414. โ€–bโ€– = โˆš(1ยฒ + 1ยฒ + 0ยฒ) = โˆš2 โ‰ˆ 1.414. *Cosine similarity:* cos(ฮธ) = 1 / (โˆš2 ยท โˆš2) = 1 / 2 = 0.5. The cosine distance is 1 โˆ’ 0.5 = 0.5, which is what `vector_distance_cos` would return. *Euclidean distance:* โ€–a โˆ’ bโ€– = โˆš((1โˆ’1)ยฒ + (0โˆ’1)ยฒ + (1โˆ’0)ยฒ) = โˆš(0 + 1 + 1) = โˆš2 โ‰ˆ 1.414. These three numbers โ€” dot product = 1, cosine similarity = 0.5, L2 distance โ‰ˆ 1.414 โ€” describe different aspects of the relationship between **a** and **b**. In the exercises that follow, you will see these same computations expressed as SQL function calls over stored vectors. --- ## Part 2 โ€” Vector Databases ### 4. What Is a Vector Database? A vector database is a data store built around one core operation: given a query vector **q**, return the *k* stored vectors most similar to **q**. Every other feature โ€” indexing, filtering, replication, APIs โ€” exists to make that single operation fast, accurate, and convenient at scale. This section explains why that operation is hard, what problems it solves, and how vector databases compare to the data systems you already know. **The core operation.** Given a query vector **q** and *n* stored vectors, find the *k* vectors most similar to **q**. This is the *k*-nearest-neighbour (KNN) problem. Exact KNN requires computing the distance from **q** to every stored vector โ€” O(*n* ยท *d*) work per query. At *n* = 1 000 000 and *d* = 768, that is 768 million floating-point operations for a single query, far too slow for interactive use. Vector databases solve this by using approximate nearest-neighbour (ANN) algorithms (covered in ยง5) that trade a small accuracy loss for orders-of-magnitude speed gains. An ANN index can answer the same query in milliseconds by examining only a tiny fraction of the stored vectors. **Use cases.** The ability to find "semantically similar" items powers a wide range of applications: - **Semantic search:** find documents that match the *meaning* of a query, not just its keywords โ€” a search for "how to fix a flat tyre" retrieves results about "changing a punctured wheel" even though no words overlap. - **Recommendation:** given an item a user just viewed or purchased, return the *k* most similar items from the catalogue (ยง11), or surface content preferred by users with similar taste profiles. - **Retrieval-Augmented Generation (RAG):** retrieve the most relevant passages from a knowledge base before prompting a large language model, so the model's answer is grounded in real documents rather than its training data alone (ยง12). - **Duplicate and near-duplicate detection:** identify items that are semantically identical or extremely close to a given item โ€” useful for deduplicating support tickets, detecting plagiarism, or clustering similar product listings. - **Anomaly detection:** items whose vectors are far from all stored vectors are likely anomalous, enabling outlier detection without hand-crafted rules. - **Multi-modal search:** find images matching a text description, or vice versa, by storing CLIP-style joint embeddings where text and image vectors share the same space. **vs. relational databases.** SQL `WHERE` clauses perform exact matches and range queries on scalar values โ€” equality, greater-than, `LIKE`, `IN`. There is no built-in notion of "nearest" for an array of floats. You cannot write `ORDER BY similarity(embedding, ?)` in standard SQL because the concept does not exist in the relational model. Extensions like **pgvector** (PostgreSQL) and **sqlite-vec** (SQLite / Turso) add vector column types, distance functions, and ANN indexes to existing relational databases, letting you combine vector search with traditional filtering in a single query. This course uses sqlite-vec via the `libsql` crate, which means you get vector search without leaving the SQLite ecosystem you may already know. **vs. full-text search (BM25 / TF-IDF).** Traditional keyword search scores documents by how often query terms appear, weighted by rarity across the corpus. It works well when users know the exact vocabulary of the documents they want, but it cannot handle synonymy โ€” "car" and "automobile" are unrelated tokens unless you maintain an explicit synonym list โ€” and it has no concept of sentence-level meaning. Vector search captures both synonymy and broader conceptual similarity because the embedding model learns those relationships from data. In practice, **hybrid search** โ€” combining a BM25 keyword score with an ANN vector score โ€” outperforms either method alone and is a common pattern in production systems. **Key metrics.** When evaluating a vector database or an ANN index, four numbers matter: - **Recall@k:** the fraction of the true *k* nearest neighbours that the ANN algorithm actually returns. A recall@10 of 0.95 means 95 out of every 100 true top-10 results are found; the other 5 are replaced by slightly less similar vectors. - **QPS (queries per second):** how many queries the index can serve per second at a given recall target. Higher is better; this is the throughput you care about in production. - **Index build time:** the one-time cost paid to construct the search index from raw vectors. HNSW indexes, for example, require inserting each vector into a multi-layer graph, which can take minutes to hours for large datasets. - **Memory footprint:** HNSW stores graph edges in RAM alongside the vectors themselves, which limits how large the index can grow on a single machine. Quantisation and disk-backed indexes reduce memory at the cost of recall or latency. **Where sqlite-vec and Turso fit.** sqlite-vec is an excellent choice for embedded applications, local development, prototyping, and small-to-medium corpora โ€” up to a few million vectors. It runs inside your application process with no separate server, and Turso adds cloud hosting, replication, and edge caching on top of the same SQLite foundation. For larger-scale deployments โ€” tens of millions of vectors, multi-tenancy, complex filtered search, or distributed indexing โ€” dedicated vector databases such as Pinecone, Qdrant, or Weaviate provide additional infrastructure. The concepts you learn in this course transfer directly: the same embeddings, distance functions, and query patterns apply regardless of which engine you choose. --- ### 5. Under the Hood: ANN Algorithms **Why not exact search?** Brute-force KNN computes the distance from the query vector to every stored vector โ€” O(*n* ยท *d*) work per query. At *n* = 1 000 000 vectors, *d* = 768 dimensions, and 1 000 queries per second, that is roughly 768 billion floating-point operations per second โ€” infeasible on a commodity CPU. Approximate nearest-neighbour (ANN) algorithms find results in O(log *n*) or sub-linear time at the cost of occasionally missing a few true nearest neighbours. The two dominant families are HNSW and IVFFlat. **HNSW โ€” Hierarchical Navigable Small World.** HNSW is the dominant algorithm for in-memory ANN and is the algorithm used by sqlite-vec. Imagine a multi-level skip list where each level is a proximity graph. The top level is sparse, containing only a small subset of nodes connected by long-range edges that enable fast coarse navigation across the dataset. Each subsequent level adds more nodes and shorter-range edges, increasing density. The bottom level contains every vector, connected to its nearest neighbours by short-range edges that enable precise local search. When a query arrives, the algorithm starts at an entry point on the top level and greedily moves to whichever neighbour is closest to the query vector. When no neighbour on the current level is closer than the current node, the algorithm descends one level and repeats the greedy walk with the denser graph. At the bottom level, it collects the *k* nearest candidates encountered during traversal and returns them as the result. **HNSW key parameters:** - **`M`** โ€” the number of bidirectional connections each node maintains per layer. Higher *M* improves recall (the algorithm has more paths to explore) but increases memory consumption and slows down inserts because more edges must be evaluated and updated. A typical default is 16. - **`ef_construction`** โ€” the size of the dynamic candidate list used when inserting a new vector into the graph. Higher values produce a higher-quality index (better-connected graph) at the cost of slower index construction. A typical default is 200. - **`ef_search`** โ€” the size of the candidate list used during query-time traversal. Higher values improve recall at the cost of higher query latency. This parameter is often set equal to *k* by default, but increasing it is the easiest way to trade latency for accuracy at query time. HNSW supports incremental inserts with no full rebuild โ€” each new vector is linked into the existing graph structure, which is why the `CREATE INDEX ... USING libsql_vector_idx` in ยง6 requires no separate training step. The memory cost of the graph is O(*n* ยท *M* ยท 4 bytes) on top of the vectors themselves. **IVFFlat โ€” Inverted File with flat quantisation.** IVFFlat is the dominant approach for disk-based or GPU-accelerated ANN and is used by default in systems like Faiss and pgvector. The idea is to partition the dataset into `nlist` Voronoi cells using k-means clustering during a one-time training step. Each cell is defined by a centroid vector, and every stored vector is assigned to the cell whose centroid is closest. At query time, the algorithm computes the distance from the query to all `nlist` centroids, selects the `nprobe` nearest centroids, and then performs exact brute-force search only within those cells โ€” skipping the vast majority of the dataset entirely. **IVFFlat key parameters:** - **`nlist`** โ€” the number of clusters (Voronoi cells). A common heuristic is to set `nlist` โ‰ˆ โˆš*n*. More clusters mean each cell is smaller, so query-time search within a cell is faster, but training takes longer and very small cells increase the risk of a query's true neighbours falling in an unsearched cell. - **`nprobe`** โ€” the number of clusters examined at query time. Higher `nprobe` improves recall at the cost of higher latency. Setting `nprobe` = `nlist` degenerates to exact search; setting `nprobe` = 1 checks only the single most likely cluster. Unlike HNSW, IVFFlat requires a training step (the k-means clustering) before any data can be inserted. Incremental inserts require assigning each new vector to an existing cluster, which can degrade quality over time as the data distribution drifts from the original centroids โ€” periodic retraining is recommended for heavily updated datasets. IVFFlat uses less memory than HNSW for the same *n* because it does not store graph edges. **sqlite-vec uses HNSW.** The `libsql_vector_idx` index type you created in ยง6 builds an HNSW index โ€” which is why rows can be inserted incrementally with no training step. The current sqlite-vec API does not expose *M* or *ef* parameters directly; sensible defaults are chosen for broad applicability. **Summary table.** | Property | HNSW | IVFFlat | |---|---|---| | Query time | O(log *n*) | O(*nprobe* ยท *n* / *nlist*) | | Insert | Incremental | Batch (requires training) | | Memory | Higher (graph edges) | Lower | | Recall@10 at defaults | ~0.95+ | ~0.90+ (depends on *nprobe*) | | Used by | sqlite-vec, Qdrant, Weaviate | pgvector, Faiss | --- ## Part 3 โ€” Turso + sqlite-vec Basics ### 6. Setting Up This section walks through everything you need before writing a single SQL query: adding the right crates, opening a local Turso connection, and loading the `sqlite-vec` extension that gives SQLite vector-search superpowers. #### What You Are Building Turso is a SQLite-compatible database with built-in support for vector similarity search via the `sqlite-vec` extension. In local development you use a file-backed SQLite database; in production the same code points at a Turso cloud database. The `libsql` crate (the Rust client for Turso) speaks the Turso wire protocol and also handles local SQLite files transparently. #### Cargo.toml Create a new binary project and add the following dependencies: ```sh cargo new vec-demo cd vec-demo ``` Replace the `[dependencies]` section of `Cargo.toml` with: ```toml [dependencies] libsql = "0.9" tokio = { version = "1", features = ["full"] } ``` `libsql` is the official Rust client for Turso / libSQL databases. It supports both local SQLite files and remote Turso connections with the same API, making it straightforward to develop locally and deploy to the cloud. `tokio` provides the async runtime โ€” all `libsql` operations are `async`. Add the release-build optimisation profile from the project conventions: ```toml [profile.release] opt-level = "z" lto = true strip = true codegen-units = 1 ``` #### Opening a Local Connection Replace `src/main.rs` with the following: ```rust use libsql::{Builder, Database}; #[tokio::main] async fn main() -> Result<(), Box> { let db: Database = Builder::new_local("vectors.db").build().await?; let conn = db.connect()?; // Verify the connection works let mut rows = conn.query("SELECT sqlite_version()", ()).await?; if let Some(row) = rows.next().await? { let version: String = row.get(0)?; println!("SQLite version: {version}"); } Ok(()) } ``` Run it with `cargo run`. You should see output like: ``` SQLite version: 3.46.0 ``` A file named `vectors.db` will appear in the current directory. This is a standard SQLite database โ€” you can open it with any SQLite client to inspect its contents. #### Enabling Vector Support with sqlite-vec The `libsql` crate ships with `sqlite-vec` built in. No separate installation is required. Vector functions become available automatically once you use the right column types and functions in your SQL. The key types and functions you will use throughout this course: | Construct | Purpose | |---|---| | `F32_BLOB(d)` | Column type for storing a *d*-dimensional float32 vector | | `vector(json_array)` | Creates a vector from a JSON array literal | | `vector_extract(blob)` | Converts a stored vector blob back to a JSON array | | `vector_distance_cos(a, b)` | Cosine distance between two vectors (0 = identical, 2 = opposite) | | `libsql_vector_idx(col)` | Index type for fast approximate nearest-neighbour search | | `vector_top_k(table, query, k)` | Table-valued function: returns the *k* nearest rows to a query vector | #### Creating a Vector Table Extend `main` to create a table that stores 3-dimensional float32 vectors: ```rust conn.execute( "CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY, label TEXT NOT NULL, embedding F32_BLOB(3) NOT NULL )", (), ).await?; ``` `F32_BLOB(3)` declares a column that holds a 3-dimensional float32 vector stored as a binary blob. The `3` is the dimensionality โ€” use the actual size of your embedding model's output (e.g., `F32_BLOB(768)` for a 768-dimensional model) in real projects. #### Creating a Vector Index Without an index, nearest-neighbour search performs a full table scan โ€” computing the distance from the query to every stored vector. For small tables this is fine; at scale you need an index: ```rust conn.execute( "CREATE INDEX IF NOT EXISTS items_vec_idx ON items (embedding) USING libsql_vector_idx(embedding)", (), ).await?; ``` This creates an HNSW index over the `embedding` column. Queries that use `vector_top_k` will automatically use this index. The index is updated incrementally as rows are inserted or deleted โ€” no manual rebuild is required. #### Putting It Together At this point your `main.rs` should look like this: ```rust use libsql::{Builder, Database}; #[tokio::main] async fn main() -> Result<(), Box> { let db: Database = Builder::new_local("vectors.db").build().await?; let conn = db.connect()?; // Verify connection let mut rows = conn.query("SELECT sqlite_version()", ()).await?; if let Some(row) = rows.next().await? { let version: String = row.get(0)?; println!("SQLite version: {version}"); } // Create vector table conn.execute( "CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY, label TEXT NOT NULL, embedding F32_BLOB(3) NOT NULL )", (), ).await?; // Create HNSW index conn.execute( "CREATE INDEX IF NOT EXISTS items_vec_idx ON items (embedding) USING libsql_vector_idx(embedding)", (), ).await?; println!("Database ready."); Ok(()) } ``` `cargo run` should print: ``` SQLite version: 3.46.0 Database ready. ``` You now have a working local vector database. Exercises 1 through 5 build on this foundation, adding data, querying it, and connecting the full embedding-to-search pipeline. --- ### 7. Exercise 1 โ€” Storing and Retrieving Vectors **Goal:** Insert a small set of labelled vectors into the `items` table created in ยง6, then retrieve them with a `SELECT` and deserialize the stored blob back into a Rust `Vec`. ๐Ÿšง Full content tracked in [nbd:081a55]. --- ### 8. Exercise 2 โ€” K-Nearest Neighbor Search **Goal:** Use `vector_top_k` and `vector_distance_cos` to find the *k* vectors in the database most similar to a query vector, and display the results ranked by similarity score. ๐Ÿšง Full content tracked in [nbd:5674ce]. --- ## Part 4 โ€” Real Applications ### 9. Generating Embeddings in Rust Before you can search by meaning, you need a way to convert text into vectors. This section covers two approaches available in Rust: running a local embedding model with `fastembed-rs` (no API key, works offline, suited for smaller models) and calling an HTTP embedding API such as the OpenAI Embeddings endpoint (larger, higher-quality models at the cost of latency and a network dependency). ๐Ÿšง Full content tracked in [nbd:4c961f]. --- ### 10. Exercise 3 โ€” Semantic Document Search **Goal:** Build a complete semantic search pipeline: embed a small corpus of text documents, store the embeddings in Turso, then accept a natural-language query, embed it, and return the top-*k* most relevant documents using vector similarity โ€” all without any keyword matching. ๐Ÿšง Full content tracked in [nbd:1ef9f4]. --- ### 11. Exercise 4 โ€” Recommendation Engine **Goal:** Implement item-based collaborative filtering using vector similarity. Store item feature vectors (or learned item embeddings) in Turso, then given a target item, retrieve the *k* most similar items as recommendations. ๐Ÿšง Full content tracked in [nbd:e8be9a]. --- ### 12. Exercise 5 โ€” Retrieval-Augmented Generation **Goal:** Combine vector search with a language model to build a retrieval-augmented generation (RAG) pipeline: given a user question, retrieve the most relevant passages from a document store using semantic search, inject them into a prompt as context, and stream the language model's grounded answer back to the user. ๐Ÿšง Full content tracked in [nbd:5ed295].