Group
← Back
Part 1 of 4 · Vector search

Vector search - The problem

May 4, 2026MLAlgorithms

Suppose you have a list of N things and someone hands you a query. You want to find the thing in the list most similar to the query.

To make this tractable, you represent your things as points in some space and define a distance between points. "Most similar" becomes "closest". This reframing is the entire setup for nearest neighbor search, and it shows up everywhere:

  • A directory of restaurants stored as (latitude, longitude). Query: where you're standing. Closest entry: nearest restaurant.
  • A library of paint swatches stored as (R, G, B). Query: a color sampled from a photo. Closest entry: best matching paint.
  • A database of protein conformations stored as 3K-dimensional vectors (K atoms with x, y, z each). Query: a newly observed conformation. Closest entry: most structurally similar known protein.
  • A collection of sentences turned into 768-dimensional embedding vectors by a language model. Query: a search phrase. Closest entry: most semantically similar sentence in the database.

Same shape every time. Points in some space, a distance function, and the goal of finding the closest point to a query.

The problem has a few standard variants. Single nearest returns one closest point. k-nearest (k-NN, where NN stands for nearest neighbor) returns the k closest. Range search returns every point within distance r. Approximate nearest neighbor (ANN) returns a point that's close to the closest, accepting some error in exchange for speed.

The metric also varies. Each one captures a different notion of "how far apart":

  • Euclidean distance is the straight-line distance between two points, the d-dimensional Pythagoras (square root of the sum of squared coordinate differences).
  • Manhattan distance (also called L1, or taxicab distance) is the sum of absolute coordinate differences, named for the grid-walking distance between two intersections in a city where you can't cut through buildings.
  • Cosine distance measures the angle between two vectors and ignores their magnitudes entirely, which makes it the standard choice for embeddings where direction encodes meaning and length doesn't.
  • Hamming distance is the number of positions where two equal-length strings differ, useful for bit vectors and short binary codes.

The structure of the problem stays roughly the same regardless of which metric you pick, though the algorithms have to be careful about which one they assume. For the rest of this article I'll mean Euclidean distance and a single nearest neighbor, unless I say otherwise.

One dimension: easy

If your "points" are scalar numbers, the problem is trivial. Sort the list once. To answer a query Q, binary-search for the position where Q would fit, and check the two list entries on either side. The closer of the two is the answer.

Binary search itself is repeated halving: compare Q to the middle element of the sorted list, discard whichever half can't contain the answer, and recurse on the half that can. Each step halves the remaining search space, so after steps you're down to a single position. Twenty steps for a million-element list, thirty for a billion.

preprocess: sort list                   O(N log N)
query:      i = binary_search(list, Q)  O(log N)
            return whichever of list[i-1], list[i] is closer to Q

The cost grows so slowly that you can stop thinking about it.

The reason this works is structural. In one dimension, sorted order is a representation of geometric proximity. Two numbers that are next to each other in the sorted array are exactly the closest pair around them on the number line. Binary search is just descending into a recursive partition of a 1D space, with the sort-order acting as a perfectly informative summary of where every point lies relative to every other.

Two dimensions: still tractable, more work

In 2D you cannot just sort by one coordinate. Two points can sit next to each other in x but be far apart in y. Sorted order along one axis tells you nothing about closeness in the plane.

You need a structure that respects both axes at once. The classical answer is the kd-tree (k-dimensional tree, where the k is a placeholder for however many dimensions you happen to be working in). It's a binary tree that recursively cuts the plane into smaller and smaller rectangles, with each cut chosen so that roughly half the points fall on each side.

To make a single cut, you pick an axis and draw a line perpendicular to it. Pick the x-axis, you get a vertical line; pick y, you get a horizontal one. The reason for axis-aligned cuts is that "which side is the query on" reduces to a single coordinate comparison: is the query's x bigger or smaller than the line's x?

Where do you draw the line? Through the median point along that axis. The phrase needs unpacking. The median of a list of numbers is the middle entry once they're sorted: in [1, 3, 5, 8, 9], the median is 5. The "median point along an axis" extends this to 2D points: take all your points, sort them by just one of their coordinates (ignore the other one entirely), and pick the middle entry. Its coordinate on the chosen axis is the splitting value, and the cutting line passes through it.

Why the median, specifically? Because it splits the points into two groups of equal size. Equal-sized children means a balanced tree of depth , which is what makes the descent cheap. If you split at some arbitrary other point, one side could end up with most of the data and the tree would degenerate toward a list.

Here's a concrete build with seven points:

Pointxy
A14
B37
C52
D68
E83
F26
G75

At the root we'll split on . Sort the points by their -coordinate alone: A, F, B, C, D, G, E (with values ). The median, the 4th of 7, is C with . Draw a vertical line at . The three points with (A, F, B) become the left subtree. The three with (D, G, E) become the right subtree. C itself sits at the root and remembers that its splitting axis was and its splitting value was .

At depth 1 we split on , alternating axes as we descend. In the left subtree, the points sorted by are A (), F (), B (). Median: F with . Within the left half-plane (the region where ), draw a horizontal line at . A is below the line, B is above, F sits at this internal node.

Symmetric work on the right. Sort D, G, E by : E (), G (), D (). Median: G with . Within the right half-plane (), draw a horizontal line at . E goes below, D goes above, G sits at the node.

The resulting tree:

  • Root: C, splits the plane on .
    • Left child: F, splits the left half-plane on .
      • Left leaf: A.
      • Right leaf: B.
    • Right child: G, splits the right half-plane on .
      • Left leaf: E.
      • Right leaf: D.

The plane is now carved into four rectangles, each holding one leaf, with the cutting lines passing through C, F, and G. Each internal node is associated with a rectangular region (the part of the plane that ends up below it in the tree), and the regions get smaller as you descend.

Now the search. To find the nearest neighbor of a query Q, you walk down the tree to a leaf, then walk back up checking whether anything was missed.

The descent. Start at the root. At each internal node, compare Q's coordinate on the node's splitting axis to the node's splitting value. Go left if smaller, right if larger. Repeat until you hit a leaf. The point at that leaf is your first candidate for the nearest neighbor; track its distance from Q as your current best.

Why a single descent isn't enough. The descent only ever picked sides based on splitting lines, never on actual distances to points. It lands you in the leaf whose rectangular cell contains Q, which is a reasonable starting guess, but not necessarily the right answer. Q can sit near the edge of its cell, with the actual nearest neighbor a short hop across a splitting line in a neighboring cell. The leaf can't see across its own boundary. The backtrack pass is what catches those cases.

The pruning bound. As you walk back up, you keep track of the closest point seen so far, call its distance best. At each ancestor you have to ask: could the other subtree, the one the descent didn't enter, contain anything closer than best?

There's a clean geometric answer. Every kd-tree splitting line is axis-aligned, so the shortest distance from Q to that line is simply the gap along the splitting axis: , where and are the coordinates of the query and the splitting line on that axis. Any point on the far side of the line lives at least that far from Q, since reaching it from Q requires crossing the line. So the comparison is straightforward: if , the entire far subtree is too far away to matter and you skip it. Otherwise the line passes within of Q, the far side might genuinely contain a closer point, and you have to recurse: descend into the far subtree the same way you descended from the root, leaf-search and all.

While you're at the ancestor, also check whether the ancestor's own point (internal nodes hold real points, not just splitting lines) is closer than best, and update best if so.

A worked search. Take the tree built above and query . The actual nearest neighbor will turn out to be F, but it's a couple of levels above the leaf the descent reaches.

Descent. At C (splits ), , go left. At F (splits ), , go left. Hit leaf . Initial best: A at distance .

Back up to . Distance from Q to F is , better than , so best becomes F at . F's splitting line is . Gap from Q to the line: , which is less than . The far subtree might hold a closer point, so descend into it. The far side is the single leaf , at distance . Tie; nothing changes.

Back up to . Distance from Q to C is , worse than . C's splitting line is . Gap: , less than , can't prune. Descend the right subtree. At , distance to Q is , worse than . G's split is , sits exactly on the line, gap is , so both sides have to be searched: gives distance , gives , neither beats .

Final answer: F at distance .

The animation below walks through the build and the search one move at a time. Dashed gray lines are partitions; the yellow ring marks the node we're currently examining; the green ring is the running best, with a green segment to Q showing its distance; faded gray segments are points already visited that didn't improve on the best; the blue circle is the search radius (current best distance); the orange perpendicular shows the pruning bound, the minimum distance to anything in the far subtree.

1 / 16
Seven points

Seven points labeled A through G. We'll build a kd-tree on them, then search for the nearest neighbor of Q=(4, 5).

This particular query landed near every splitting line, so almost no pruning fired and the search visited every node. That's the worst case. On a typical query with reasonably spread-out data, most splits sit well away from Q, the bound fires at most levels, and the average cost works out to about . That's what earns the structure its keep.

There's another structure worth knowing about, the Voronoi diagram: a partition of the plane into cells, where each cell consists of all locations closest to one specific point. In 2D, you can precompute it in time, store it in space, and answer queries in by point-locating Q in the diagram. This gives you provably optimal query time in 2D.

Three dimensions, and a bit beyond

Both ideas extend to 3D. The kd-tree alternates among three axes instead of two. Pruning still works because a 3D bounding box still gives you a usable lower bound on the distance from Q to any point inside it. The Voronoi diagram extends too, but its combinatorial complexity grows, and it stops being a clean win past 3D.

Up to maybe 10 or 15 dimensions, kd-trees and their relatives (ball trees, R-trees, cover trees) give you sublinear search on roughly uniform data. The exact crossover depends on the data and the implementation, but the regime is well understood.

Above that, something starts to break. Pruning fires less and less. The tree starts to behave like a linear scan with bookkeeping overhead. To see why, we need to compute the expected distance and its variance, and to do that precisely we'll first lock down what those terms mean.

A short detour: density, expectation, variance

Three ideas show up in the next derivation and deserve a clean definition.

Probability density. Pick a uniformly random number from . What's the probability of landing exactly at ? Zero. The interval has uncountably many points, and if any single one of them carried positive probability the total across all of them would exceed 1. So probability has to live in intervals, not in points. The probability of landing in , for instance, is , since that interval covers 20% of the line. The general pattern declares a density , probability per unit length, and reads probabilities as areas under it:

The total area must be 1, since has to land somewhere. For uniform on , "every value equally likely" forces the density to be the same at every point in the interval. Call that constant . The area under over the interval is then a rectangle of width 1 (the length of ) and height , so its area is . Setting that equal to 1 gives . Hence on and elsewhere.

Expectation. A plain average of over an interval is : integrate the values and divide by the total weight, which here is just the interval's length. A weighted average gives each point its own weight instead of treating them all equally, and divides by the total weight . Probability densities are built to integrate to 1, so that division is already implicit, and the formula collapses to:

For uniform on , on the interval, so weighting by it doesn't change anything and the integral simplifies to . That is why the earlier derivation could write .

Variance. Variance is the expected squared deviation from the mean:

It quantifies how far values typically sit from the mean. The standard deviation has the same units as and is the more interpretable form.

The normal distribution. The bell-shaped density specified by a mean and a variance :

It dominates probability theory because of the central limit theorem: a sum of many independent random variables tends toward a normal distribution, regardless of the shape of the individual terms. Squared distance between two random points is exactly such a sum, with one term per dimension, which is why we will see its distribution become more and more bell-shaped (and tighter) as grows.

The chart below bundles all four ideas. The blue curve is the density; the green vertical line is the mean ; the dashed yellow lines mark . Drag the sliders, or click Animate σ to see the curve breathe.

blue: density  ·  green: mean μ  ·  dashed yellow: μ ± σ

With those tools in place, we can ask precisely how distances behave when grows.

What happens to distances as the dimension grows

Pick two random points and in dimensions. Each coordinate is a fresh random number between 0 and 1, picked so that every value in is equally likely (a uniform distribution: no clustering toward any part of the range, no bias) and so that knowing any one coordinate tells you nothing about any of the others (the coordinates are independent). and are two random points scattered into the unit cube with no preferred direction or region.

The squared distance between them is a Pythagoras sum over terms:

What follows is the full derivation that the mean of this sum grows like , the variance like , and their ratio (the coefficient of variation) shrinks like . If you trust the result, skip past it; the table and demo right after make the same point in concrete numbers.

▸ Show the full derivation (mean, variance, and coefficient of variation)

We want , the expected value of this -term sum. Two facts about expectation reduce the whole thing to a one-term computation.

First, expectation is linear: for any random variables and , . (This drops out of the integral definition of expectation, since the integral of a sum is the sum of the integrals.) Applied to our -term sum:

Second, every term has the same distribution. Each pair is two independent uniform draws on , identical setup to every other pair, so comes out to the same value for every . The sum collapses to copies of one number:

where and stand in for one independent uniform pair. The -dimensional answer is just times the one-term mean, so the rest of this section computes that one term.

Expand the square and take expectations of each piece. The cross-term factors because and are independent ( when the variables are independent), so:

To evaluate the right-hand side, we need and . From the detour, the expected value of any function for uniform on is the plain integral of over the interval:

Apply that with and :

is the midpoint of the interval, the obvious answer for a uniform distribution. is less obvious: squaring shrinks fractions ( gives , gives ), so is biased toward small values, and its average lands at , well below the midpoint of the range .

has the same moments by symmetry. Plug everything in:

Each squared coordinate difference has mean . Multiply by from the bridge above:

So the expected squared distance grows linearly with . The expected distance grows like , which keeps getting larger as increases.

That's the mean. But the mean alone doesn't tell us whether all pairs of points sit close to it or whether they spread out around it. To know that, we need the variance.

Now the variance of . From the detour, . Apply that with :

The squared piece is just , since we already have .

For the fourth moment, the same expand-and-take-expectations strategy from before applies. Expand :

Take expectations term by term. By independence every cross-term factors into a product of expectations. The single-variable moments for a uniform on are , so , , (and the same for by symmetry). Plug in:

Putting the two pieces together:

For independent random variables (which our coordinate differences are, by construction), variance is additive: . (The cross-term in vanishes when and are independent.) Combined with identical distribution, the -term sum collapses the same way the mean did:

And the standard deviation is:

Now the interesting ratio. Divide the standard deviation by the mean:

This is the coefficient of variation () of the squared distance. It tells you how big a typical fluctuation is, measured in units of the mean. As grows, the shrinks like . Concretely:

Dimension
10.1670.1971.183
20.3330.2790.837
30.5000.3420.683
101.6670.6240.374
10016.671.9720.118
768128.05.4650.043
100001666.719.720.012

The same effect, plotted: the distribution of squared distances normalised so the mean sits at 1. At low d the curve is a wide bump; bumping d up a couple of orders of magnitude collapses it into a near-spike. The clipping at the top of the chart for large d is the visual signature of the concentration.

dimension d2
density of normalised squared distance, mean fixed at 1 (log-scaled slider, d ranges 1 → 1000)

In 1D, the standard deviation is bigger than the mean. Distances are wildly variable. A pair of random points might be very close or very far apart, with no particular preference.

In 100D, the standard deviation is about 12% of the mean. Most random pairs sit within about 12% of the average squared distance.

In 768D, the standard deviation is 4% of the mean. Almost every pair has nearly the same squared distance.

In 10000D, it's 1%. The whole distribution of pairwise distances has collapsed into a tight band.

In short, for anyone who skipped the algebra: this is the law of large numbers wearing a geometric hat. Squared distance is a sum of independent contributions, one per coordinate. Adding more dimensions means more terms in the sum, and as you sum more independent random terms, the total settles closer to its expected value in relative terms. The typical distance grows, and the absolute spread between different pairs' distances grows too, but slower: the spread as a fraction of the typical distance shrinks as . By 768 dimensions, that fraction is around 4%, and every pair of random points sits at almost the same distance from every other. That is the curse of dimensionality, derived from first principles.

What this means for the algorithm

In 768 dimensions, take a query point and a million random database points. Compute the distance from the query to each. The numbers all come out around , with a spread of about 2%. The closest of the million is around 11.0 from the query. The farthest is around 11.6. The "nearest neighbor" is about 5% closer than the "farthest neighbor", out of a million candidates.

When everything is roughly equidistant, "nearest" stops being a useful concept. And exact algorithms that rely on geometric pruning lose their grip.

Recall how the kd-tree pruning works. At each node, you compare the splitting hyperplane's distance from Q against your current best distance. If the hyperplane is farther, you skip the other subtree. If closer, you descend.

In 2D with real spread in distances, your current best might be 0.1 while a hyperplane sits 0.3 away. You skip cleanly. Pruning fires.

In 768D, your current best is around 11.0, because that's roughly what every point's distance to Q is. The hyperplanes inside your unit cube are all closer than 11.0 to a typical query, because the cube's "size" along any axis is just 1, much smaller than the typical pair distance. Every hyperplane is closer than your best. Pruning never fires. You descend every branch of the tree. Linear scan with bookkeeping overhead.

This is not a kd-tree-specific problem. Every exact nearest-neighbor structure that relies on geometric separation, ball trees, R-trees, cover trees, locality-sensitive hashing in its provable forms, hits the same wall. The space itself stops carrying useful information about which points are closer than which. No partitioning scheme can extract progress from a uniform distribution of distances.

Why this matters in practice

The natural objection is: who cares about random points in [0, 1]^d? Real data isn't random. And that's correct. The curse as derived above applies to data that's uniformly random in the full ambient dimension. Real embedding vectors aren't.

A 768-dimensional sentence embedding doesn't fill the 768D space. It sits on a much lower-dimensional manifold inside it. The model that produced the embedding was trained to place semantically related sentences near each other and unrelated ones far apart, which means the data is structured, not random. The intrinsic dimension of the data, whatever it actually is, is much lower than 768. This is what makes nearest-neighbor search meaningful at all on real embeddings: the relevant variation lives in a small number of effective dimensions, and distances along those directions still discriminate.

But the engineering still has to fight the curse. Indexes that assume nice geometry in the ambient space, like kd-trees do when they prune based on coordinate-aligned hyperplanes, get fooled by the high ambient dimensionality even when the underlying data is intrinsically low-dimensional. The pruning bound is computed in the full 768D space, where the bound is too weak to be useful, even if the data only varies along 30 directions.

So we need algorithms that don't rely on partitioning the ambient space. And we typically also accept approximation, because exact answers in high dimensions are too expensive to be worth chasing when "close to closest" is good enough for the application.

That's where the field of approximate nearest neighbor search begins, and that's what the next article in this series picks up.