← Home

Building HNSW From First Principles

Estimated reading time: 18-22 minutes | ~3,600 words

TL;DR

  • I built HNSW from scratch in Go so I could justify every line of the algorithm, not just copy it.
  • The real contract is simple: high recall, low latency, and memory that scales roughly with the dataset size.
  • The neighbor-selection heuristic exists to keep the graph navigable on clustered data, not as a micro-optimization.
  • The hierarchy is a hypothesis worth testing; on high-dimensional embeddings it may add little.
  • Distance metric choice (L2 vs cosine vs dot product) changes hub formation and can flip conclusions.
  • I used a tiny two-cluster experiment plus a handful of first-party benchmarks to keep the story honest.

Table of Contents


The Contract

I wanted a vector index I could explain without hand-waving. The contract is simple: given n vectors of dimension d, find the approximate k nearest neighbors for a query vector q with high recall, bounded latency, and memory overhead that does not blow up.

For interactive semantic search, a reasonable target looks like this:

  • High recall@k (for example, >= 95% depending on the application).
  • Low latency (for example, < 10ms per query in a real-time service).
  • Memory roughly proportional to O(n * d), plus moderate overhead.

These are example targets, not absolutes. The point is that brute force fails all three at scale, and any index worth building must hit this contract in practice.

I used GloVe as my baseline because it is real, dense, and semantic. It is also large enough (1.2M vectors in the 100d set) to make the O(n) cost obvious.


A Baseline on GloVe-100 (20k Subset)

I ran my local harness on a 20,000-vector subset of glove.6B.100d.txt with 200 queries, k=10, M=16, efConstruction=100, and L2 distance. This is not meant to be a universal benchmark. It’s just a concrete baseline for this post.

Command I ran:

go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2 -compare -demo-word king

Queries are sampled from the indexed subset (not a held-out set). I exclude the query’s own index when computing brute-force ground truth and when scoring HNSW, so self-matches do not inflate recall. This is index introspection, not a train/test split: queries lie exactly on the data manifold, so recall numbers are optimistic compared to real out-of-distribution workloads. For studying the data structure’s behavior, that tradeoff is fine. I also ran a held-out check below to sanity-test this assumption.

Brute force vs HNSW on the same subset:

MethodRecall@10QPSp50 (us)p95 (us)
Brute force1.000744.713151563
HNSW0.9979,089.3112147

Post-build memory snapshot (Go RSS): ~157 MiB (heapAlloc ~65.6 MiB, heapSys ~151.4 MiB).

I also used that same index for a quick sanity check:

Nearest neighbors for "king":
  king (dist=0.0000)
  prince (dist=16.7458)
  queen (dist=18.3291)
  monarch (dist=20.0182)
  brother (dist=20.5814)
  uncle (dist=21.7993)
  nephew (dist=22.0494)
  son (dist=22.1171)
  throne (dist=22.3965)
  kingdom (dist=22.5374)

The demo includes “king” itself because it is an indexed vector. The distances are squared L2, so the absolute values are not intuitive. The ordering is the signal.


Distance Is Not a Detail

Distance metric choice shapes the graph you end up building.

  • L2 (Euclidean): Unbounded distances. In high dimensions, distance concentration causes some points to appear in many k-NN lists (the hubness phenomenon), often correlated with proximity to the data centroid. These hubs dominate graph traversal.
  • Cosine (angular): Vectors are normalized to the unit sphere, so distances are bounded. Hub formation is weaker, which is why Radovanovic et al. call this the “anti-hub” effect in cosine spaces (JMLR 2010).
  • Dot product: Equivalent to cosine on normalized vectors. On unnormalized vectors, magnitude dominates, and high-norm vectors become hubs for a different reason entirely.

This matters because HNSW is a graph algorithm. If the metric changes hub formation, the graph’s navigation properties change too. That is why the FlatNav result later is not a contradiction, it is a consequence.


The Navigability Insight

Graph-based ANN only works if greedy or beam search can usually make progress.

The practical goal is this: for most queries q and most nodes v that are not among the true k nearest neighbors, at least one neighbor of v should be closer to q than v is. That is an empirical property of a good graph, not a guarantee. No finite graph on real data can enforce it universally.

Greedy descent alone gets stuck at local minima, which is why HNSW uses beam search with ef > 1. Beam search is the safety valve that lets you escape dead ends.

This is the first-principles leap. The algorithm is not about clever data structures. It is about preserving navigability under distance.

Kleinberg’s theorem is the theoretical lens I keep in my head (STOC 2000): greedy routing is efficient only when links span distance scales in the right way. HNSW bakes that into the graph instead of hoping it happens.


Build a Flat NSW Graph First

I started by building a flat Navigable Small World (NSW) graph. It is a useful stepping stone because it shows what the data structure wants to be before you add hierarchy.

The recipe is simple:

  • Insert points one at a time.
  • For each new point, connect it to its M nearest existing neighbors.
  • Early nodes create long-range edges because the graph is small.
  • Later nodes create short-range edges because the local neighborhood is dense.

This already produces a graph with links at multiple distance scales. It is the simplest place to look for failure modes.

Here is the mental model I use:

early inserts -> long-range links
late inserts  -> short-range links
flat graph    -> implicit distance scales

It is also the fastest way to see why the heuristic exists.


Why the Heuristic Exists

The neighbor-selection heuristic in the HNSW paper (Malkov & Yashunin 2016) is not a micro-optimization. It exists to keep the graph navigable on clustered data.

I used a tiny 2D toy example: two well-separated clusters, M=8, and a simple selection rule (“take the closest M neighbors”) versus the heuristic rule (“accept a candidate only if it is closer to the query than any already-selected neighbor”).

Here is the exact output from my script:

Simple selection: 0/3200 edges cross clusters (0.000%)
Heuristic selection: 10/3200 edges cross clusters (0.312%)

Brute-force top-5 neighbors for query (1,1):
  idx= 28 label=0 dist2=0.003
  idx=162 label=0 dist2=0.020
  idx=107 label=0 dist2=0.042
  idx= 97 label=0 dist2=0.059
  idx= 96 label=0 dist2=0.117

The simple selection creates no cross-cluster edges. The graph becomes effectively disconnected for greedy search between clusters. The heuristic introduces just a handful of bridge edges, which is enough to restore navigability.

As the Vespa team puts it, “If a node or cluster is disconnected from the main cluster you’ll never reach it” (Vespa HNSW blog). That line is about greedy reachability, not strict graph connectivity, and it is exactly what the heuristic repairs.

This is why I treat the heuristic as non-negotiable.


The Heuristic in Code

Here is the exact logic I use in my implementation. c.Dist is the distance from the candidate to the query, precomputed during search. I only keep a candidate if it is not “covered” by an already-selected neighbor.

for _, c := range candidates {
    ok := true
    for _, s := range selected {
        if distance(c.ID, s) < c.Dist {
            ok = false
            break
        }
    }
    if ok {
        selected = append(selected, c.ID)
        if len(selected) >= M {
            break
        }
    }
}

This is a diversity rule disguised as geometry. It forces the neighborhood to spread out instead of collapsing into a single cluster.


Insertion in Code

Insertion is where everything comes together: greedy routing at higher layers, beam search at the target layer, the heuristic neighbor selection, and pruning back to M (or 2M at layer 0). Here is the core of my Add method, trimmed but faithful:

func (g *Graph) Add(vector []float32) uint32 {
    id := uint32(len(g.Nodes))
    level := g.randLevel()

    n := Node{
        ID:          id,
        Vector:      vector,
        Level:       level,
        Connections: make([][]uint32, level+1),
    }
    g.Nodes = append(g.Nodes, n)

    if len(g.Nodes) == 1 {
        g.EntryPoint = id
        g.MaxLevel = level
        return id
    }

    ep := g.EntryPoint
    maxLevel := g.MaxLevel

    for l := maxLevel; l > level; l-- {
        best := g.searchLayer(vector, ep, 1, l)
        if len(best) > 0 {
            ep = best[0].ID
        }
    }

    for l := min(maxLevel, level); l >= 0; l-- {
        cands := g.searchLayer(vector, ep, g.EfConstruction, l)
        targetM := g.M
        if l == 0 {
            targetM = g.M0
        }
        neighbors := g.selectNeighbors(id, cands, targetM, l)

        g.Nodes[id].Connections[l] = append(g.Nodes[id].Connections[l], neighbors...)
        for _, nb := range neighbors {
            g.addConnection(nb, id, l)
            g.pruneConnections(nb, l)
        }

        if len(neighbors) > 0 {
            ep = neighbors[0]
        }
    }

    if level > g.MaxLevel {
        g.EntryPoint = id
        g.MaxLevel = level
    }

    return id
}

I like this function because it is the whole story in one place. If I delete the hierarchy loop, I get a flat NSW. If I delete the heuristic, I get a graph that fails on clusters. If I delete the pruning, memory explodes.


The Hierarchy as a Hypothesis

Once the flat graph works, the hierarchy becomes a hypothesis. The intuition is that at large n, the flat graph has too many short edges and not enough long ones, so greedy search wanders before it finds the right neighborhood (the zoom-out problem).

The fix is multiscale navigation:

  • Layer 0 contains all nodes (dense, short-range).
  • Each higher layer contains roughly n / M^l nodes (sparse, long-range).
  • Search starts at the top and descends through layers.

The layer assignment is geometric:

l = floor(-ln(U) / ln(M))

For M=16 and n=1,000,000, you expect around five layers. The top layer typically has just a handful of nodes.

The key properties I keep explicit in my implementation:

  • Edges are bidirectional.
  • Neighbor lists are pruned to M (upper layers) or 2M (layer 0).
  • The entry point is always the highest-level node.

Bidirectionality improves strong connectivity and reduces dead ends, but it is not strictly required for correctness. It is one more empirical lever.

I treat the hierarchy as a hypothesis I test, not a law I blindly accept. That framing makes the FlatNav result feel like a payoff rather than a contradiction.


Beam Search Is the Recall Knob

HNSW uses beam search inside SEARCH-LAYER. That choice is essential.

With ef=1, the algorithm is just greedy descent. It is fast and gets stuck easily.

With ef > 1, the algorithm keeps multiple candidates alive. This is how it escapes local minima, and it is why efSearch is the primary recall knob. If you only remember one tuning rule, it should be: raise efSearch first.


Beam Search in Code

The heart of the search is a small loop that keeps two heaps: a min-heap of candidates to explore next, and a max-heap of the best results seen so far. The stop condition is simple: when the closest unexplored candidate is worse than the worst current result, you are done.

for candidates.Len() > 0 {
    cand := popMin(&candidates)
    worst := peekMax(&results)
    if cand.Dist > worst.Dist {
        break
    }
    for _, nb := range g.Nodes[cand.ID].Connections[level] {
        if visited[nb] {
            continue
        }
        visited[nb] = true
        d := g.Distance(query, g.Nodes[nb].Vector)
        if results.Len() < ef || d < worst.Dist {
            pushMin(&candidates, Neighbor{ID: nb, Dist: d})
            pushMax(&results, Neighbor{ID: nb, Dist: d})
            if results.Len() > ef {
                popMax(&results)
            }
            worst = peekMax(&results)
        }
    }
}

The moment ef goes above 1, this becomes a controlled breadth-first search over the graph rather than a brittle greedy walk.


The Practical Parameter Map

There are three knobs that matter in practice:

  • efSearch controls the recall/latency tradeoff at query time.
  • M controls memory overhead and graph connectivity.
  • efConstruction controls build time and build quality.

The memory rule of thumb from hnswlib is:

memory approx n * (d * 4 + M * 8-10) bytes

In a tight C++ implementation like hnswlib, that means 1M vectors at 128d with M=16 land around 640-670 MB. In Go, slice headers and GC metadata add overhead, so actual RSS will be higher. At 768d, the graph overhead is a few percent either way because the vectors dominate.

My tuning loop looks like this:

  1. Fix M to a reasonable value (16 or 32).
  2. Sweep efSearch until recall saturates.
  3. If recall plateaus too low, increase M.
  4. If build time explodes, reduce efConstruction.

This is a way to keep the knobs grounded in empirical tradeoffs rather than folklore.


Held-Out Query Check (20k Subset)

To make sure the in-subset queries were not hiding anything, I also ran a held-out query experiment: I held out 2,000 vectors, built the index on the remaining 18,000, and sampled queries only from the hold-out set.

Command I ran:

go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -holdout 2000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2
MethodRecall@10QPSp50 (us)p95 (us)
Brute force1.000792.212751340
HNSW0.9889,899.999126

The recall drop is small but real. If you care about out-of-distribution behavior, you should run a proper held-out evaluation on your own embeddings.


A Quick Sweep on GloVe-100 (20k Subset)

I ran a small sweep on 20,000 vectors from glove.6B.100d.txt with 200 queries, k=10, M=16, efConstruction=100, and L2 distance. This is a local run on my laptop, so treat the absolute numbers as “directional,” not universal. The curve is the point.

Command I ran:

go run ./cmd/hnsw-demo -mode=sweep -glove ~/Datasets/glove.6B.100d.txt -limit 20000 -queries 200 -k 10 -m 16 -efc 100 -metric l2 -ef-list 10,25,50,100,200
efSearchrecall@10QPSp50 (us)p95 (us)
100.96230,149.13245
250.98717,797.85772
500.99711,013.292119
1000.9995,907.3174212
2001.0003,275.7314382

This is exactly the tradeoff I expect: as efSearch rises, recall climbs and latency/QPS degrade smoothly. It’s a knob that behaves like a knob, which is rare and useful.


The FlatNav Twist

The FlatNav paper (“Down with the Hierarchy” arXiv 2024) argues that for high-dimensional data (d >= 32), the hierarchy adds little benefit. The work was presented at the first Vector Databases workshop at ICML 2025 (VecDB@ICML 2025), not the main conference.

Their hypothesis is the “hub highway” effect: in high-dimensional L2 spaces, a few nodes become hubs and form a dense highway that can serve a similar role to the hierarchy’s long-range connections. The strength of this effect depends on intrinsic dimensionality, data distribution, and distance metric, so it is not a universal law.

The paper also notes that this effect is weaker under cosine distance, which lines up with the metric discussion earlier.

This gave me a concrete test to run: compare a flat NSW graph to full HNSW on the same dataset. On the same 20k GloVe-100 subset as above (k=10, M=16, efConstruction=100, efSearch=50), I got this:

IndexRecall@10QPSp50 (us)p95 (us)
Flat NSW0.9989,279.5109136
HNSW0.9979,089.3112147

In this local run, the flat graph was actually a hair faster at the same recall. That does not mean the hierarchy is useless, but it does suggest the graph structure itself is doing most of the work at this scale. A caveat: at 20k vectors, hierarchy effects are muted. The real regime shift likely starts at 1M+, where long-range sparsity in the flat graph becomes more costly.

It is also worth noting that GloVe-100’s nominal dimensionality (100) may not reflect its intrinsic dimensionality, which is what actually drives hubness and navigability. Two datasets with the same embedding dimension can behave very differently if their intrinsic dimensionality differs.

The natural follow-up is to repeat this at lower dimensions (GloVe-25) and under cosine distance, where the FlatNav paper’s hub-highway effect should be weaker.


A 100k Scale Check

I re-ran the exact same parameters on a 100,000-vector subset to see how the curve moves with scale. Build time was about 1m19s for HNSW and 1m20s for the flat graph on my laptop. RSS after build was ~853 MiB for HNSW.

Command I ran:

go run ./cmd/hnsw-demo -mode=glove -glove ~/Datasets/glove.6B.100d.txt -limit 100000 -queries 200 -k 10 -m 16 -efc 100 -ef 50 -metric l2 -compare -demo-word king

Brute force vs HNSW on the 100k subset:

MethodRecall@10QPSp50 (us)p95 (us)
Brute force1.000143.169937497
HNSW0.9604,403.7229322

Flat NSW vs HNSW at the same parameters:

IndexRecall@10QPSp50 (us)p95 (us)
Flat NSW0.9614,184.6243319
HNSW0.9604,403.7229322

The big change is recall. At 100k, the same M=16 and efSearch=50 settings are no longer enough for ~0.99 recall. That is exactly what the tuning loop predicts: if you want the old recall, you raise efSearch or increase M and pay the price.


What Production Systems Add

Once the algorithm is clear, production systems are easier to understand:

  • hnswlib focuses on tight memory layout and SIMD distance kernels.
  • Qdrant integrates filtering during graph traversal and adds healing for deletes.
  • Weaviate pushes Go to SIMD with hand-written assembly and gets big speedups in distance functions.

All of them keep the same HNSW skeleton. The differences are in persistence, concurrency, filtering, quantization, and memory layout.


What I Built

I built a minimal HNSW implementation in Go. The full source is on GitHub. I kept it intentionally small: no persistence, no concurrency, no filtering, no quantization. I wanted the data structure, not the product.

The core structs (trimmed for clarity) look like this:

type Node struct {
    ID          uint32
    Vector      []float32
    Level       int
    Connections [][]uint32
}

type Graph struct {
    Nodes          []Node
    EntryPoint     uint32
    MaxLevel       int
    M, M0          int
    EfConstruction int
    ML             float64
    Distance       DistanceFunc
    UseHeuristic   bool
    Flat           bool
}

One implementation detail worth calling out: when I prune neighbor lists, I also remove back-edges to keep the graph consistent. That deviates slightly from the paper, but it keeps the adjacency lists symmetric in practice.


M Sweep on GloVe-100 (20k Subset)

I ran a quick M sweep on the same 20,000-vector subset (200 queries, k=10, L2 distance) to sanity-check the memory/recall tradeoff.

MRecall@10QPSp50 (us)p95 (us)Build Time
80.98714,422.4671014.15s
160.99710,836.99512010.48s
321.0006,906.515019537.67s

Closing Thoughts

I like HNSW because the data structure is the story. Every piece exists because something breaks without it.

  • The heuristic exists because clustered data breaks navigability.
  • The hierarchy exists because long-range jumps become rare at scale.
  • Beam search exists because greedy search gets stuck.

The FlatNav work complicates the story in a good way. It suggests that some of the “obvious” machinery may be optional in the regimes I actually care about. That is exactly what I want from a weekend build.

This post also fits a pattern I’ve fallen into recently: build the thing, find the edge cases, and then compare it to the production version. I did that with my LSM storage engine and with the S3-based queue. HNSW feels like the next logical step.

If you are running embeddings at sub-billion scale and you can keep everything in RAM, HNSW is still a strong default. Beyond that, DiskANN (graph-based, SSD-optimized, powers Bing and SQL Server 2025) and IVF-PQ (quantization-based, the backbone of FAISS) are the established options for billion-scale search.

I am a huge fan of understanding the algorithm at this level because it turns tuning from folklore into engineering. That is the point of building it from scratch.