Vector search - On disk | Valentin Radu

The previous article ended at the wall HNSW hits when datasets get large. Every node holds a full-precision vector for the inner-loop distance computations, every edge is a pointer, and the dataset itself dominates memory: a billion 768-dimensional float32 vectors take 3 terabytes, far past what fits in RAM on a normal machine. This article works through how production systems push HNSW past that wall by serving the index from solid-state storage.

The HNSW algorithm itself doesn't change. What changes is how the data is laid out and how the inner loop reads it, because random reads from SSD cost roughly a thousand times more than random reads from RAM. Microsoft's DiskANN, built on the Vamana algorithm, is the standard production graph index for disk. It reorganises HNSW's storage and tweaks the construction so that each step of a graph traversal is one disk page read.

Where the bytes go

Take a 1-billion-vector index at 768 dimensions and look at where the memory goes:

Component	Bytes per vector	Total
Full float32 vectors	3072	3 TB
HNSW edges (`M = 16`, 4-byte indices, 2× at layer 0)	~128	128 GB
Other metadata (IDs, level, etc.)	~16	16 GB

The full vectors dwarf everything else. They're also what dominates the inner loop: every distance computation during graph traversal reads a 3072-byte vector. A modern server has at most a few terabytes of RAM, so a billion-vector index doesn't fit, even before considering replication, query memory, or any slack. The graph itself fits comfortably (128 GB of edges is fine), but the vectors do not. Either you compress the vectors to a few percent of their raw size, or you push them to disk, or you do both.

This article works through pushing them to disk. With NVMe SSDs at single-digit dollars per terabyte, storage is essentially free; the bottleneck shifts from raw capacity to IO latency.

Why graphs are SSD-friendly

Random access on an NVMe SSD is around 50 to 100 microseconds per page. Random access in RAM is around 100 nanoseconds. SSD random reads are roughly 1000× slower than RAM random reads, and that ratio shapes everything that follows.

For sequential reads, SSDs are much better; bandwidth is in the gigabytes per second, comparable to RAM throughput. Sequential is roughly $10 \times$ slower than RAM, not 1000×.

The reason for the gap is that every SSD read carries a fixed startup cost: some lookup work to find where the data lives, a brief warm-up of the chip area being read, and protocol overhead to ship the result back to the CPU. That cost is roughly the same whether you ask for one byte or a few kilobytes, so a small random read pays the full overhead for very little data. Sequential reads amortise the same overhead across long stretches of bytes, so once it has been paid the data flows at full bandwidth. The gap between the two regimes ends up at roughly two orders of magnitude on modern hardware.

So the goal becomes: do as few random reads per query as possible, and arrange them so they look as close to sequential as possible.

Brute-force search would read every vector in the database for every query. A billion 3072-byte vectors at full sequential bandwidth (around 5 GB/s per drive) takes about $3 \times 1 0^{12}$ bytes / $5 \times 1 0^{9}$ bytes/s ≈ 600 seconds: ten minutes per query, far too slow to be useful.

A graph traversal of HNSW reads at most $efSearch$ nodes per query. The parameter efSearch (introduced in Part 2) is the beam-search budget: the cap on how many candidates the search keeps in flight, and therefore on the number of graph nodes it visits. Production values are typically in the tens to hundreds, tuned per workload to trade recall against latency. If each node and its adjacency list fit in one SSD page (typically 4 to 8 KB), each visited node costs one random read of around 100 microseconds. So one graph hop costs 100 µs, and the total query cost is the per-hop cost times the number of hops: efSearch × 100 µs. With efSearch = 100 that totals 10 ms per query (not per hop). That's fast enough for most applications, especially when queries run in parallel.

The shape of the win: graph search visits a small constant number of nodes per query (in the tens to hundreds) regardless of how big the dataset is, while brute force visits every vector. On disk, that gap becomes the difference between milliseconds and minutes per query.

DiskANN and Vamana

Microsoft's DiskANN, published at NeurIPS 2019, is the standard disk-resident vector index. The graph construction algorithm at its heart is called Vamana.

Vamana differs from HNSW in three ways, all of them shaped by the constraints of running from disk.

Single layer, no hierarchy

Where HNSW stacks progressively sparser layers (each layer holding $1/ M$ of the points below), Vamana uses a single layer that contains every point. This avoids the fixed cost of fetching multiple layers' data per query and simplifies the storage layout so that one node lookup means one disk read. But a flat graph won't be small-world by accident; Vamana has to construct it that way.

Three parameters control the construction:

$R$ : the maximum out-degree per node. Each node ends up with at most $R$ outgoing edges. Typical production value: 64.
$L$ : the beam width used during the construction-time graph search. The same beam-search machinery from Part 2, with $L$ playing the role of efSearch. Typical production value: 100.
$α$ : the pruning strictness, between 1 and roughly 1.4. Used in the α-rule (next subsection). Typical: 1.0 in the first pass and 1.2 in the second.

The graph starts initialised with every dataset point holding $R$ random outgoing edges, a uniformly bad but connected starting topology. The algorithm then iterates over the dataset's $N$ points in a random order, processing one point at a time. For each current point $p$ , three steps run in sequence.

1. Greedy beam search from a fixed entry point. A beam search runs from a hard-coded entry node, typically the medoid of the dataset (the point closest to the geometric centre of all the data), toward $p$ . The beam search behaves like the one from Part 2: it maintains a result set capped at $L$ closest points to $p$ seen so far, and a candidate queue of unexpanded points. At each step it pops the closest unexpanded candidate, evaluates distances from $p$ to each of that node's graph neighbours, and pushes any neighbour that beats the worst entry in the result set.

The output we care about is not the final result set but the set $V$ of every node visited during the search. $V$ typically contains hundreds of points (more than $L$ , because nodes drop out of the result set as closer ones arrive but stay in $V$ ), and it is the candidate pool from which $p$ 's neighbours will be selected.

2. Diversified pruning. $V$ typically holds a few hundred candidates after the search; we want at most $R$ of them as $p$ 's real outgoing edges, picked so they span different directions away from $p$ rather than all clustering in the closest one. The selection runs the α-rule, detailed in the next subsection; its output is a list of up to $R$ neighbours. These overwrite $p$ 's previous outgoing edges in the graph: the random initial edges on the first pass, or the result of pass 1 on the second pass.

3. Bidirectional update. For each selected neighbour $j$ , the algorithm also adds a back-edge from $j$ to $p$ , so that the graph stays approximately undirected. If $j$ 's out-degree now exceeds $R$ , $j$ 's entire neighbour list is re-pruned with the same α-rule to bring it back to at most $R$ outgoing edges.

Repeat for every point. Steps 1 through 3 above describe what happens for one point $p$ . The same three steps run for every point in the dataset, processed in a random order chosen at construction time. With $N = 1 0^{9}$ points, that's a billion iterations of search-prune-update. Each iteration's cost is independent of $N$ : the beam search visits about $L$ nodes, the α-rule examines a candidate set of similar size, and the bidirectional update fires $R$ back-edges. So total construction time is linear in $N$ multiplied by the per-iteration constants, parallelisable across cores.

The randomised processing order matters. If points were processed in geographic order (sweeping the dataset region by region), early-processed points would all have tight local connections to their region and the construction would inherit that bias.

Critically, each iteration can refine points processed earlier. When the current point $p$ 's bidirectional update lands a back-edge on a previously-processed neighbour $j$ , $j$ 's neighbour list might overflow $R$ and trigger another α-prune. Edges originally chosen for $j$ when the graph was barely populated can be replaced by better ones once half the dataset has been inserted. The graph keeps converging toward a stable configuration as the pass progresses, with no point ever being "re-processed" explicitly.

The whole pass is then repeated a second time on the same dataset, but using the graph produced by the first pass as its starting topology rather than a random one, and with a higher $α$ . The first pass with $α = 1$ produces a tight, locally-connected backbone where every neighbour represents a unique direction. The second pass with $α = 1.2$ relaxes the diversification rule and adds long-range edges on top of that backbone. Two passes are typically enough; the result is a flat graph where every node has a balanced mix of short-range and long-range neighbours, all baked in by the α-rule rather than by stacking layers.

Total construction cost is on the order of $O (N \cdot L \cdot R)$ distance computations per pass, plus the cost of the bidirectional updates. For $N = 1 0^{9}, L = 100, R = 64$ , that's about $6 \times 1 0^{12}$ distance computations per pass, doable in tens of minutes on a multi-core machine.

Robust pruning: the α-rule

From $V$ we need to pick at most $R$ candidates as $p$ 's real neighbours. The most obvious choice would be the $R$ closest to $p$ , but that produces a bad graph for two reasons.

First, the $R$ nearest candidates all sit in $p$ 's immediate vicinity, so the resulting edges are all short-range. The graph as a whole then has no long-range edges, and greedy walks between far-apart points have to step through many short hops to cross any distance.

Second, two close candidates can land along the same line out from $p$ , one closer and one farther. Picking both gives $p$ two edges pointing in the same direction. The closer one already covers that direction (a search at $p$ aiming for the farther one would hop to the closer one first), so the farther edge is redundant and wastes a slot we could give to a candidate in a different direction.

The α-rule selects a more useful set. Walk the candidates from closest to farthest. Pick each one as a neighbour of $p$ , then prune any later candidate that sits in roughly the same direction as one already picked. The pruning leaves slots in $p$ 's neighbour budget for candidates that are farther out but cover new directions; some of these become the long-range edges that make the graph small-world. Here is the pseudocode:

RobustPrune(p, V, α, R):
  result = []
  while V is not empty and |result| < R:
    c* = candidate in V minimising dist(p, c*)   // closest remaining
    add c* to result
    remove c* from V
    for each c in V:
      if α · dist(c*, c) ≤ dist(p, c):
        remove c from V                          // dominated by c*
  return result

The dominance check is what does the work. Picture $p$ , the just-picked candidate $c^{*}$ , and some other candidate $c$ still sitting in $V$ . The condition

α \cdot dist (c^{*}, c) \leq dist (p, c)

asks whether $c^{*}$ is closer to $c$ than $p$ is, by at least a factor of $α$ . If the condition holds, $c^{*}$ effectively sits between $p$ and $c$ on the way over: a search starting at $p$ and aiming for $c$ would hop to $c^{*}$ first and finish the trip from there. The edge $p \to c$ would be redundant because the edge $p \to c^{*}$ already covers that direction. We say $c$ is dominated by $c^{*}$ and remove it from $V$ .

If the condition does not hold, $c$ sits in a meaningfully different direction from $c^{*}$ relative to $p$ , and stays in $V$ as a viable candidate for a later iteration to pick.

The parameter $α$ controls how strictly "same direction" is enforced.

$α = 1$ : the condition simplifies to $dist (c^{*}, c) \leq dist (p, c)$ , " $c^{*}$ is at least as close to $c$ as $p$ is." Easy to satisfy. Most candidates that fall on the same side of $p$ as $c^{*}$ get pruned, and the surviving neighbours span genuinely distinct directions. The resulting graph is locally connected with no long-range edges, similar in shape to a minimum spanning tree.
$α = 1.2$ (the standard production value): the condition tightens to $dist (c^{*}, c) \leq dist (p, c) /1.2 \approx 0.83 \cdot dist (p, c)$ , " $c^{*}$ has to be noticeably closer to $c$ than $p$ is, no more than about 83% of $p$ 's distance to $c$ ." Harder to satisfy. Fewer candidates get pruned, more survive. Among the survivors are some that are roughly in the same direction as $c^{*}$ but far enough out that the inequality fails for them; these farther survivors become the long-range edges that make the graph small-world.

Higher $α$ produces more long-range edges per node. Construction gets slower because fewer candidates are pruned at each step, so more bidirectional updates have to run. Querying gets faster because long-range edges shorten the average path to the answer. At very high $α$ the rule prunes almost nothing and the graph drifts toward random; recall drops.

A worked example

The demo below runs the α-rule on a small candidate set side by side: $α = 1$ on the left, $α = 1.2$ on the right. The new point $P$ sits at the centre with six candidates around it. The target out-degree is $R = 4$ . Step through the iterations to see how the same candidate set produces a different neighbour list under the two settings.

α = 1

α = 1.2

1 / 6

Initial state. Six candidates around P, sorted by distance to P. Will select up to R = 4 neighbours.

The single difference between the two runs is whether candidate $D$ (in the far north-east) survives the first iteration's pruning. Under $α = 1$ , $D$ is dominated by $A$ (the closest candidate $A$ sits 3.61 from $D$ , and $P$ sits 4.24 from $D$ , so $A$ is closer to $D$ than $P$ is, and $D$ is pruned). Under $α = 1.2$ , the threshold tightens: $D$ would only be dominated if $A$ were at least 20% closer to it than $P$ is, which $A$ is not, so $D$ survives. By the time the algorithm has selected three close-range neighbours and looks for a fourth, $D$ is the only survivor and gets picked. The $α = 1$ run terminates with three short edges; the $α = 1.2$ run adds a long-range fourth.

That single long-range edge is the kind of "shortcut" that gives small-world graphs their name. Across a graph of millions of nodes, a handful of such shortcuts per node are enough to keep the average distance between any two nodes small, even when most edges are short.

Sector-aligned layout

Each node is stored on disk as a contiguous block of fixed size, sized to match the SSD's natural read unit (typically 4 KB on modern NVMe drives, sometimes 8 KB). The block packs together everything a graph hop needs: a small header, the node's adjacency list, and the node's full-precision vector, all in one sequence of bytes.

neighbour count

4 B

neighbour IDs

R = 64, four-byte IDs

256 B

full-precision vector

d = 768, four-byte floats

3,072 B

padding

764 B

Total: 4,096 B (one 4 KB SSD page)

For $R = 64$ neighbours and $d = 768$ dimensions, the payload is $4 + 256 + 3072 = 3332$ bytes, padded to 4096 bytes. The page size constrains the maximum out-degree: at higher $R$ (e.g., 128) or higher $d$ (e.g., 1024), the layout has to switch to 8 KB pages or split the vector into a separate file.

Node IDs map directly to disk offsets: node $i$ lives at byte offset $i \times 4096$ , with no indirection or lookup table needed. Reading a node is one disk read at that offset; the SSD returns one 4 KB page, and the application picks the neighbour list and the vector out of the bytes. The work to parse the page costs nothing compared to the disk read itself.

Multiple reads can overlap in flight to hide their latency behind throughput. A modern NVMe drive can serve about a million 4 KB random reads per second when reads are pipelined this way.

A typical disk-resident search

A query against a DiskANN index follows the same shape as HNSW search. Maintain a candidate queue (closest unexpanded points by distance to Q), pop the closest unexpanded one, fetch its page from disk, evaluate Q's distance to each of its neighbours, push promising candidates into both the result set and the queue. Stop when the closest remaining candidate is farther from Q than the worst result.

Each iteration involves one random SSD read (around 100 microseconds) and a handful of distance computations against neighbours (in RAM, fast). The total wall-clock latency is dominated by SSD reads: efSearch iterations × 100 microseconds ≈ a few milliseconds per query at production settings.

Because each page read brings the full-precision vector along with the adjacency list, every distance computation along the way uses the original float32 coordinates. The result set is ranked by exact distance and there is no separate reranking pass.

Putting it together

A pure HNSW-on-disk deployment has a simple storage layout. The graph index, with each node and its full-precision vector colocated in one SSD page, lives entirely on disk. The OS page cache in RAM accelerates repeated reads of hot regions, but no other vector data is held in RAM persistently.

Rough guide by scale:

Dataset size	Standard recipe
Up to 1 M vectors	Plain HNSW in RAM.
1 M to 100 M	HNSW in RAM on a single machine.
100 M to 10 B	HNSW in RAM on a high-end machine, or DiskANN with the graph and full vectors on disk.
10 B+	DiskANN on a single beefy machine, or sharded across machines.

The series so far, and what's next

Three articles, working from the geometry outward:

The problem. Why nearest neighbor search is easy in 1D, manageable in 2D, and breaks in 768D. The curse of dimensionality, derived from first principles.
Small worlds. How to do approximate nearest neighbor search by walking on a graph. Naive proximity graphs and their failures, Watts-Strogatz, NSW, HNSW.
On disk. How to scale HNSW to billions by serving the heavy data from SSD with DiskANN and Vamana, where each graph hop is one disk page read.

The arc so far, in one sentence: pairwise distance information collapses in high dimensions, but the geometry of where points sit on their low-dimensional manifold can still be exploited by a graph that encodes neighbourhoods directly, and once that graph exists, you can lay it out on disk so that even billion-scale datasets answer queries in milliseconds.

Upcoming parts of the series will cover quantization, GPU-accelerated indexes, hybrid search combining vector NN with metadata filters, and streaming indexes for continuously-changing datasets.