Building HNSW From First Principles
Estimated reading time: 18-22 minutes | ~3,600 words
TL;DR
- I built HNSW from scratch in Go so I could justify every line of the algorithm, not just copy it.
- The real contract is simple: high recall, low latency, and memory that scales roughly with the dataset size.
- The neighbor-selection heuristic exists to keep the graph navigable on clustered data, not as a micro-optimization.
- The hierarchy is a hypothesis worth testing; on high-dimensional embeddings it may add little.
- Distance metric choice (L2 vs cosine vs dot product) changes hub formation and can flip conclusions.
- I used a tiny two-cluster experiment plus a handful of first-party benchmarks to keep the story honest.
Table of Contents
- The Contract
- A Baseline on GloVe-100 (20k Subset)
- Distance Is Not a Detail
- The Navigability Insight
- Build a Flat NSW Graph First
- Why the Heuristic Exists
- The Heuristic in Code
- Insertion in Code
- The Hierarchy as a Hypothesis
- Beam Search Is the Recall Knob
- Beam Search in Code
- The Practical Parameter Map
- Held-Out Query Check (20k Subset)
- A Quick Sweep on GloVe-100 (20k Subset)
- The FlatNav Twist
- A 100k Scale Check
- What Production Systems Add
- What I Built
- M Sweep on GloVe-100 (20k Subset)
- Closing Thoughts
The Contract
I wanted a vector index I could explain without hand-waving. The contract is simple: given n vectors of dimension d, find the approximate k nearest neighbors for a query vector q with high recall, bounded latency, and memory overhead that does not blow up.
For interactive semantic search, a reasonable target looks like this:
- High recall@k (for example, >= 95% depending on the application).
- Low latency (for example, < 10ms per query in a real-time service).
- Memory roughly proportional to
O(n * d), plus moderate overhead.
These are example targets, not absolutes. The point is that brute force fails all three at scale, and any index worth building must hit this contract in practice.
I used GloVe as my baseline because it is real, dense, and semantic. It is also large enough (1.2M vectors in the 100d set) to make the O(n) cost obvious.
A Baseline on GloVe-100 (20k Subset)
I ran my local harness on a 20,000-vector subset of glove.6B.100d.txt with 200 queries, k=10, M=16, efConstruction=100, and L2 distance. This is not meant to be a universal benchmark. It’s just a concrete baseline for this post.
Command I ran:
go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2 -compare -demo-word king
Queries are sampled from the indexed subset (not a held-out set). I exclude the query’s own index when computing brute-force ground truth and when scoring HNSW, so self-matches do not inflate recall. This is index introspection, not a train/test split: queries lie exactly on the data manifold, so recall numbers are optimistic compared to real out-of-distribution workloads. For studying the data structure’s behavior, that tradeoff is fine. I also ran a held-out check below to sanity-test this assumption.
Brute force vs HNSW on the same subset:
| Method | Recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| Brute force | 1.000 | 744.7 | 1315 | 1563 |
| HNSW | 0.997 | 9,089.3 | 112 | 147 |
Post-build memory snapshot (Go RSS): ~157 MiB (heapAlloc ~65.6 MiB, heapSys ~151.4 MiB).
I also used that same index for a quick sanity check:
Nearest neighbors for "king":
king (dist=0.0000)
prince (dist=16.7458)
queen (dist=18.3291)
monarch (dist=20.0182)
brother (dist=20.5814)
uncle (dist=21.7993)
nephew (dist=22.0494)
son (dist=22.1171)
throne (dist=22.3965)
kingdom (dist=22.5374)
The demo includes “king” itself because it is an indexed vector. The distances are squared L2, so the absolute values are not intuitive. The ordering is the signal.
Distance Is Not a Detail
Distance metric choice shapes the graph you end up building.
- L2 (Euclidean): Unbounded distances. In high dimensions, distance concentration causes some points to appear in many k-NN lists (the hubness phenomenon), often correlated with proximity to the data centroid. These hubs dominate graph traversal.
- Cosine (angular): Vectors are normalized to the unit sphere, so distances are bounded. Hub formation is weaker, which is why Radovanovic et al. call this the “anti-hub” effect in cosine spaces (JMLR 2010).
- Dot product: Equivalent to cosine on normalized vectors. On unnormalized vectors, magnitude dominates, and high-norm vectors become hubs for a different reason entirely.
This matters because HNSW is a graph algorithm. If the metric changes hub formation, the graph’s navigation properties change too. That is why the FlatNav result later is not a contradiction, it is a consequence.
The Navigability Insight
Graph-based ANN only works if greedy or beam search can usually make progress.
The practical goal is this: for most queries q and most nodes v that are not among the true k nearest neighbors, at least one neighbor of v should be closer to q than v is. That is an empirical property of a good graph, not a guarantee. No finite graph on real data can enforce it universally.
Greedy descent alone gets stuck at local minima, which is why HNSW uses beam search with ef > 1. Beam search is the safety valve that lets you escape dead ends.
This is the first-principles leap. The algorithm is not about clever data structures. It is about preserving navigability under distance.
Kleinberg’s theorem is the theoretical lens I keep in my head (STOC 2000): greedy routing is efficient only when links span distance scales in the right way. HNSW bakes that into the graph instead of hoping it happens.
Build a Flat NSW Graph First
I started by building a flat Navigable Small World (NSW) graph. It is a useful stepping stone because it shows what the data structure wants to be before you add hierarchy.
The recipe is simple:
- Insert points one at a time.
- For each new point, connect it to its
Mnearest existing neighbors. - Early nodes create long-range edges because the graph is small.
- Later nodes create short-range edges because the local neighborhood is dense.
This already produces a graph with links at multiple distance scales. It is the simplest place to look for failure modes.
Here is the mental model I use:
early inserts -> long-range links
late inserts -> short-range links
flat graph -> implicit distance scales
It is also the fastest way to see why the heuristic exists.
Why the Heuristic Exists
The neighbor-selection heuristic in the HNSW paper (Malkov & Yashunin 2016) is not a micro-optimization. It exists to keep the graph navigable on clustered data.
I used a tiny 2D toy example: two well-separated clusters, M=8, and a simple selection rule (“take the closest M neighbors”) versus the heuristic rule (“accept a candidate only if it is closer to the query than any already-selected neighbor”).
Here is the exact output from my script:
Simple selection: 0/3200 edges cross clusters (0.000%)
Heuristic selection: 10/3200 edges cross clusters (0.312%)
Brute-force top-5 neighbors for query (1,1):
idx= 28 label=0 dist2=0.003
idx=162 label=0 dist2=0.020
idx=107 label=0 dist2=0.042
idx= 97 label=0 dist2=0.059
idx= 96 label=0 dist2=0.117
The simple selection creates no cross-cluster edges. The graph becomes effectively disconnected for greedy search between clusters. The heuristic introduces just a handful of bridge edges, which is enough to restore navigability.
As the Vespa team puts it, “If a node or cluster is disconnected from the main cluster you’ll never reach it” (Vespa HNSW blog). That line is about greedy reachability, not strict graph connectivity, and it is exactly what the heuristic repairs.
This is why I treat the heuristic as non-negotiable.
The Heuristic in Code
Here is the exact logic I use in my implementation. c.Dist is the distance from the candidate to the query, precomputed during search. I only keep a candidate if it is not “covered” by an already-selected neighbor.
for _, c := range candidates {
ok := true
for _, s := range selected {
if distance(c.ID, s) < c.Dist {
ok = false
break
}
}
if ok {
selected = append(selected, c.ID)
if len(selected) >= M {
break
}
}
}
This is a diversity rule disguised as geometry. It forces the neighborhood to spread out instead of collapsing into a single cluster.
Insertion in Code
Insertion is where everything comes together: greedy routing at higher layers, beam search at the target layer, the heuristic neighbor selection, and pruning back to M (or 2M at layer 0). Here is the core of my Add method, trimmed but faithful:
func (g *Graph) Add(vector []float32) uint32 {
id := uint32(len(g.Nodes))
level := g.randLevel()
n := Node{
ID: id,
Vector: vector,
Level: level,
Connections: make([][]uint32, level+1),
}
g.Nodes = append(g.Nodes, n)
if len(g.Nodes) == 1 {
g.EntryPoint = id
g.MaxLevel = level
return id
}
ep := g.EntryPoint
maxLevel := g.MaxLevel
for l := maxLevel; l > level; l-- {
best := g.searchLayer(vector, ep, 1, l)
if len(best) > 0 {
ep = best[0].ID
}
}
for l := min(maxLevel, level); l >= 0; l-- {
cands := g.searchLayer(vector, ep, g.EfConstruction, l)
targetM := g.M
if l == 0 {
targetM = g.M0
}
neighbors := g.selectNeighbors(id, cands, targetM, l)
g.Nodes[id].Connections[l] = append(g.Nodes[id].Connections[l], neighbors...)
for _, nb := range neighbors {
g.addConnection(nb, id, l)
g.pruneConnections(nb, l)
}
if len(neighbors) > 0 {
ep = neighbors[0]
}
}
if level > g.MaxLevel {
g.EntryPoint = id
g.MaxLevel = level
}
return id
}
I like this function because it is the whole story in one place. If I delete the hierarchy loop, I get a flat NSW. If I delete the heuristic, I get a graph that fails on clusters. If I delete the pruning, memory explodes.
The Hierarchy as a Hypothesis
Once the flat graph works, the hierarchy becomes a hypothesis. The intuition is that at large n, the flat graph has too many short edges and not enough long ones, so greedy search wanders before it finds the right neighborhood (the zoom-out problem).
The fix is multiscale navigation:
- Layer 0 contains all nodes (dense, short-range).
- Each higher layer contains roughly
n / M^lnodes (sparse, long-range). - Search starts at the top and descends through layers.
The layer assignment is geometric:
l = floor(-ln(U) / ln(M))
For M=16 and n=1,000,000, you expect around five layers. The top layer typically has just a handful of nodes.
The key properties I keep explicit in my implementation:
- Edges are bidirectional.
- Neighbor lists are pruned to
M(upper layers) or2M(layer 0). - The entry point is always the highest-level node.
Bidirectionality improves strong connectivity and reduces dead ends, but it is not strictly required for correctness. It is one more empirical lever.
I treat the hierarchy as a hypothesis I test, not a law I blindly accept. That framing makes the FlatNav result feel like a payoff rather than a contradiction.
Beam Search Is the Recall Knob
HNSW uses beam search inside SEARCH-LAYER. That choice is essential.
With ef=1, the algorithm is just greedy descent. It is fast and gets stuck easily.
With ef > 1, the algorithm keeps multiple candidates alive. This is how it escapes local minima, and it is why efSearch is the primary recall knob. If you only remember one tuning rule, it should be: raise efSearch first.
Beam Search in Code
The heart of the search is a small loop that keeps two heaps: a min-heap of candidates to explore next, and a max-heap of the best results seen so far. The stop condition is simple: when the closest unexplored candidate is worse than the worst current result, you are done.
for candidates.Len() > 0 {
cand := popMin(&candidates)
worst := peekMax(&results)
if cand.Dist > worst.Dist {
break
}
for _, nb := range g.Nodes[cand.ID].Connections[level] {
if visited[nb] {
continue
}
visited[nb] = true
d := g.Distance(query, g.Nodes[nb].Vector)
if results.Len() < ef || d < worst.Dist {
pushMin(&candidates, Neighbor{ID: nb, Dist: d})
pushMax(&results, Neighbor{ID: nb, Dist: d})
if results.Len() > ef {
popMax(&results)
}
worst = peekMax(&results)
}
}
}
The moment ef goes above 1, this becomes a controlled breadth-first search over the graph rather than a brittle greedy walk.
The Practical Parameter Map
There are three knobs that matter in practice:
efSearchcontrols the recall/latency tradeoff at query time.Mcontrols memory overhead and graph connectivity.efConstructioncontrols build time and build quality.
The memory rule of thumb from hnswlib is:
memory approx n * (d * 4 + M * 8-10) bytes
In a tight C++ implementation like hnswlib, that means 1M vectors at 128d with M=16 land around 640-670 MB. In Go, slice headers and GC metadata add overhead, so actual RSS will be higher. At 768d, the graph overhead is a few percent either way because the vectors dominate.
My tuning loop looks like this:
- Fix
Mto a reasonable value (16 or 32). - Sweep
efSearchuntil recall saturates. - If recall plateaus too low, increase
M. - If build time explodes, reduce
efConstruction.
This is a way to keep the knobs grounded in empirical tradeoffs rather than folklore.
Held-Out Query Check (20k Subset)
To make sure the in-subset queries were not hiding anything, I also ran a held-out query experiment: I held out 2,000 vectors, built the index on the remaining 18,000, and sampled queries only from the hold-out set.
Command I ran:
go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -holdout 2000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2
| Method | Recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| Brute force | 1.000 | 792.2 | 1275 | 1340 |
| HNSW | 0.988 | 9,899.9 | 99 | 126 |
The recall drop is small but real. If you care about out-of-distribution behavior, you should run a proper held-out evaluation on your own embeddings.
A Quick Sweep on GloVe-100 (20k Subset)
I ran a small sweep on 20,000 vectors from glove.6B.100d.txt with 200 queries, k=10, M=16, efConstruction=100, and L2 distance. This is a local run on my laptop, so treat the absolute numbers as “directional,” not universal. The curve is the point.
Command I ran:
go run ./cmd/hnsw-demo -mode=sweep -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -queries 200 -k 10 -m 16 -efc 100 -metric l2 -ef-list 10,25,50,100,200
| efSearch | recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| 10 | 0.962 | 30,149.1 | 32 | 45 |
| 25 | 0.987 | 17,797.8 | 57 | 72 |
| 50 | 0.997 | 11,013.2 | 92 | 119 |
| 100 | 0.999 | 5,907.3 | 174 | 212 |
| 200 | 1.000 | 3,275.7 | 314 | 382 |
This is exactly the tradeoff I expect: as efSearch rises, recall climbs and latency/QPS degrade smoothly. It’s a knob that behaves like a knob, which is rare and useful.
The FlatNav Twist
The FlatNav paper (“Down with the Hierarchy” arXiv 2024) argues that for high-dimensional data (d >= 32), the hierarchy adds little benefit. The work was presented at the first Vector Databases workshop at ICML 2025 (VecDB@ICML 2025), not the main conference.
Their hypothesis is the “hub highway” effect: in high-dimensional L2 spaces, a few nodes become hubs and form a dense highway that can serve a similar role to the hierarchy’s long-range connections. The strength of this effect depends on intrinsic dimensionality, data distribution, and distance metric, so it is not a universal law.
The paper also notes that this effect is weaker under cosine distance, which lines up with the metric discussion earlier.
This gave me a concrete test to run: compare a flat NSW graph to full HNSW on the same dataset. On the same 20k GloVe-100 subset as above (k=10, M=16, efConstruction=100, efSearch=50), I got this:
| Index | Recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| Flat NSW | 0.998 | 9,279.5 | 109 | 136 |
| HNSW | 0.997 | 9,089.3 | 112 | 147 |
In this local run, the flat graph was actually a hair faster at the same recall. That does not mean the hierarchy is useless, but it does suggest the graph structure itself is doing most of the work at this scale. A caveat: at 20k vectors, hierarchy effects are muted. The real regime shift likely starts at 1M+, where long-range sparsity in the flat graph becomes more costly.
It is also worth noting that GloVe-100’s nominal dimensionality (100) may not reflect its intrinsic dimensionality, which is what actually drives hubness and navigability. Two datasets with the same embedding dimension can behave very differently if their intrinsic dimensionality differs.
The natural follow-up is to repeat this at lower dimensions (GloVe-25) and under cosine distance, where the FlatNav paper’s hub-highway effect should be weaker.
A 100k Scale Check
I re-ran the exact same parameters on a 100,000-vector subset to see how the curve moves with scale. Build time was about 1m19s for HNSW and 1m20s for the flat graph on my laptop. RSS after build was ~853 MiB for HNSW.
Command I ran:
go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 100000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2 -compare -demo-word king
Brute force vs HNSW on the 100k subset:
| Method | Recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| Brute force | 1.000 | 143.1 | 6993 | 7497 |
| HNSW | 0.960 | 4,403.7 | 229 | 322 |
Flat NSW vs HNSW at the same parameters:
| Index | Recall@10 | QPS | p50 (us) | p95 (us) |
|---|---|---|---|---|
| Flat NSW | 0.961 | 4,184.6 | 243 | 319 |
| HNSW | 0.960 | 4,403.7 | 229 | 322 |
The big change is recall. At 100k, the same M=16 and efSearch=50 settings are no longer enough for ~0.99 recall. That is exactly what the tuning loop predicts: if you want the old recall, you raise efSearch or increase M and pay the price.
What Production Systems Add
Once the algorithm is clear, production systems are easier to understand:
- hnswlib focuses on tight memory layout and SIMD distance kernels.
- Qdrant integrates filtering during graph traversal and adds healing for deletes.
- Weaviate pushes Go to SIMD with hand-written assembly and gets big speedups in distance functions.
All of them keep the same HNSW skeleton. The differences are in persistence, concurrency, filtering, quantization, and memory layout.
What I Built
I built a minimal HNSW implementation in Go. The full source is on GitHub. I kept it intentionally small: no persistence, no concurrency, no filtering, no quantization. I wanted the data structure, not the product.
The core structs (trimmed for clarity) look like this:
type Node struct {
ID uint32
Vector []float32
Level int
Connections [][]uint32
}
type Graph struct {
Nodes []Node
EntryPoint uint32
MaxLevel int
M, M0 int
EfConstruction int
ML float64
Distance DistanceFunc
UseHeuristic bool
Flat bool
}
One implementation detail worth calling out: when I prune neighbor lists, I also remove back-edges to keep the graph consistent. That deviates slightly from the paper, but it keeps the adjacency lists symmetric in practice.
M Sweep on GloVe-100 (20k Subset)
I ran a quick M sweep on the same 20,000-vector subset (200 queries, k=10, L2 distance) to sanity-check the memory/recall tradeoff.
| M | Recall@10 | QPS | p50 (us) | p95 (us) | Build Time |
|---|---|---|---|---|---|
| 8 | 0.987 | 14,422.4 | 67 | 101 | 4.15s |
| 16 | 0.997 | 10,836.9 | 95 | 120 | 10.48s |
| 32 | 1.000 | 6,906.5 | 150 | 195 | 37.67s |
Closing Thoughts
I like HNSW because the data structure is the story. Every piece exists because something breaks without it.
- The heuristic exists because clustered data breaks navigability.
- The hierarchy exists because long-range jumps become rare at scale.
- Beam search exists because greedy search gets stuck.
The FlatNav work complicates the story in a good way. It suggests that some of the “obvious” machinery may be optional in the regimes I actually care about. That is exactly what I want from a weekend build.
This post also fits a pattern I’ve fallen into recently: build the thing, find the edge cases, and then compare it to the production version. I did that with my LSM storage engine and with the S3-based queue. HNSW feels like the next logical step.
If you are running embeddings at sub-billion scale and you can keep everything in RAM, HNSW is still a strong default. Beyond that, DiskANN (graph-based, SSD-optimized, powers Bing and SQL Server 2025) and IVF-PQ (quantization-based, the backbone of FAISS) are the established options for billion-scale search.
I am a huge fan of understanding the algorithm at this level because it turns tuning from folklore into engineering. That is the point of building it from scratch.