What HNSW actually does to your cache

When you serve queries against an HNSW index that exceeds your buffer cache, where does the cache pressure actually fall? Which levels miss, how often, and which graph traversal patterns are pathological?

date: 2026-04-11
author: Jonathan
read: 3 min
stack: application · cache · memory · cpu

methodology reproducible

hardware: TBD — target: Intel Xeon Platinum 8480+ (Sapphire Rapids) and AMD EPYC 9654 (Genoa) for cross-uarch validation
kernel: Linux 6.18 LTS, vanilla; THP madvise
compiler: clang 19, -O3 -march=native -mavx2
dataset: hnswlib (reference) + FAISS HNSW (production validation); SIFT-1M academic + 768-dim synthetic corpus to match modern RAG embeddings

TL;DR

HNSW’s runtime cost, once the index stops fitting in last-level cache, is dominated by pointer-chasing on graph descent — not distance math. Cache miss rates on the lower layers spike exponentially past the L3 cliff, and the “vector hangover” that 2026 RAG teams are hitting is a second-order effect of graph metadata blowing working-set budget. This post shows the miss profile directly, using perf c2c and hardware counters.

Methodology

Field	Value
CPU	TBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654)
Microcode	pinned, recorded
Kernel	Linux 6.18 LTS, vanilla, THP `madvise`
Compiler	clang 19, `-O3 -march=native -mavx2`
Dataset	SIFT-1M (academic) + 768-dim synthetic (1M–100M vectors, matches modern embeddings)
Index	hnswlib reference + FAISS HNSW validation, `M={16, 32}`, `ef={50, 200}`
Governor	`performance`, turbo on
Repro	`lowlat-ms/lowlat-bench/hnsw-cache-pressure`

The question

Define “cache pressure” precisely: L1/L2/L3 miss rates, LLC occupancy, prefetcher effectiveness, TLB walks.
Single empirical question: where does the miss profile live during HNSW descent, and where is the cliff when the working set exceeds L3?
Why this angle is novel: every other HNSW post lives at the API/recall/throughput level. Nobody has shown the miss profile.

Introduction

Vector search is the hottest infra category of 2025–2026. HNSW is the dominant in-memory algorithm.
Brief mechanism recap: multi-layer graph, descend from top, greedy search at the bottom.
The “vector hangover” (https://tech-champion.com/database/the-vector-hangover-hnsw-index-memory-bloat-in-production-rag/) is the motivating real-world pain.
Why mechanical sympathy is the missing lens here.

Setup

hnswlib first for clarity (reference C++ implementation, clean code to instrument).
FAISS HNSW second for production validation.
Datasets: SIFT-1M baseline, 768-dim synthetic at 1M, 10M, and 100M to force the L3 and DRAM cliffs.
Concurrency: single-query trace first, then N-query sustained load.

Baseline

Trace a single query through the graph. Record cache misses per layer of descent.
Break time down: distance math vs pointer load vs memory stall.
Compare AoS vs SoA layout of vector storage on the same index.
Establish recall@10 and latency baseline at ef={50, 100, 200}.

Optimizations

N queries/sec sustained load: map working-set size against L2, L3, and DRAM footprint.
Find the cliff: the exact QPS at which p99 latency jumps by 10x.
Pathologies: pointer chasing on random neighbor lookups, false sharing on concurrent updates, cold-tier accesses.
Mitigations to test: prefetch neighbors before descent, per-thread LRU of recently-visited nodes, SoA vector storage, PQ quantization to shrink working set.

Results

Per-layer cache miss heatmap during query descent.
Working-set-vs-latency plot with the L3 cliff annotated.
perf c2c output showing false sharing hotspots under concurrent insert.
AoS-vs-SoA delta: how much of the runtime is memory-bound in each.

Limitations

One HNSW implementation at a time doesn’t generalize to all (Weaviate, Qdrant, Milvus, pgvector each lay out graph state differently).
Dataset shape matters: SIFT-1M is small enough that everyone’s results look good; 768-dim is where pain starts.
Hardware variation: L3 size varies wildly between Sapphire Rapids and Genoa, so the cliff location is hardware-shaped.
Not a quantization deep-dive — PQ/SQ/binary deserve their own post.

Reproducibility

Full repo: lowlat-ms/lowlat-bench/hnsw-cache-pressure, one-command driver for all configs.
Raw perf stat, perf c2c, perf mem outputs shipped alongside charts.
Dataset seeds and synthetic-vector generator pinned.
Same hardware and repro harness as post #1 (io_uring vs epoll) for cross-post comparability.

References

Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using HNSW graphs (2016) — https://arxiv.org/abs/1603.09320
hnswlib reference implementation — https://github.com/nmslib/hnswlib
FAISS — https://github.com/facebookresearch/faiss
“The Vector Hangover: HNSW Index Memory Bloat in Production RAG” — https://tech-champion.com/database/the-vector-hangover-hnsw-index-memory-bloat-in-production-rag/
P-HNSW: crash-consistent HNSW on persistent memory — https://www.mdpi.com/2076-3417/15/19/10554
Vespa, “Billion-scale vector search using hybrid HNSW-IF” — https://medium.com/vespa/billion-scale-vector-search-using-hybrid-hnsw-if-96d7058037d3