You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
69 lines
2.8 KiB
Markdown
69 lines
2.8 KiB
Markdown
---
|
|
# edu-paqf
|
|
title: '§8 Exercise 2: K-Nearest Neighbor Search'
|
|
status: completed
|
|
type: task
|
|
priority: normal
|
|
created_at: 2026-03-10T23:30:00Z
|
|
updated_at: 2026-03-10T23:30:00Z
|
|
---
|
|
|
|
## §8 Exercise 2 — K-Nearest Neighbor Search — Stub to fill
|
|
|
|
File: `edu/src/vector-db.md`, section `### 8. Exercise 2 — K-Nearest Neighbor Search`
|
|
|
|
Replace this stub line with the full exercise:
|
|
> **Goal:** Use `vector_top_k` and `vector_distance_cos` [...] 🚧 Full content tracked in [nbd:5674ce].
|
|
|
|
Follow the exercise format from `edu/src/markov.md`.
|
|
|
|
## Prerequisites (established in §7)
|
|
|
|
Reader has the `vec-demo` project and has 6 rows in the `items` table: cat, dog, car, truck, python, rust with 3-dimensional embeddings.
|
|
|
|
## Goal
|
|
|
|
Given a query vector, use `vector_top_k` to find the 3 most similar items, join with the `items` table to retrieve labels and exact cosine distances, and display the results ranked by distance.
|
|
|
|
## Steps to cover
|
|
|
|
**Step 1 — Introduce `vector_top_k`.** Explain that this is a table-valued function (TVF) that returns row IDs of approximate nearest neighbours without a full table scan. Syntax:
|
|
|
|
```sql
|
|
SELECT i.rowid FROM vector_top_k('items', vector(?), ?) i
|
|
```
|
|
|
|
The first argument is the table name (string literal), second is the query vector, third is k. Returns `rowid` values only — join to get other columns.
|
|
|
|
**Step 2 — Full KNN query.** Show the complete query combining the TVF with a JOIN and exact distance computation:
|
|
|
|
```sql
|
|
SELECT items.id, items.label, vector_distance_cos(items.embedding, vector(?)) AS dist
|
|
FROM vector_top_k('items', vector(?), ?) AS knn
|
|
JOIN items ON items.rowid = knn.rowid
|
|
ORDER BY dist ASC
|
|
```
|
|
|
|
Note: the query vector must be passed twice — once for `vector_top_k` (index traversal) and once for `vector_distance_cos` (exact distance). Both are the same JSON array string.
|
|
|
|
**Step 3 — Run three queries and print results.**
|
|
|
|
Query vectors to use:
|
|
- `[0.85, 0.15, 0.25]` → should be nearest cat and dog (animal cluster)
|
|
- `[0.15, 0.85, 0.15]` → should be nearest car and truck (vehicle cluster)
|
|
- `[0.1, 0.05, 0.92]` → should be nearest rust and python (language cluster)
|
|
|
|
Expected output format:
|
|
```
|
|
Query: [0.85, 0.15, 0.25]
|
|
1. cat dist=0.0023
|
|
2. dog dist=0.0089
|
|
3. python dist=0.1834
|
|
```
|
|
|
|
**Step 4 — Explain ANN vs. exact search.** For 6 rows, `vector_top_k` falls back to exact search anyway — the HNSW index has too few nodes to offer a shortcut. Note that at scale (millions of rows), it returns approximate results; some true nearest neighbours may be missed. `vector_distance_cos` always gives the exact distance for any specific pair.
|
|
|
|
## Reference solution
|
|
|
|
Full `main.rs` inside `<details><summary>Show full solution</summary>`. The solution should re-run setup from §7 (create table, insert data) then run the three KNN queries.
|