What HNSW actually does to your cache
When you serve queries against an HNSW index that exceeds your buffer cache, where does the cache pressure actually fall? Which levels miss, how often, and which graph traversal patterns are pathological?
- hardware
- TBD — target: Intel Xeon Platinum 8480+ (Sapphire Rapids) and AMD EPYC 9654 (Genoa) for cross-uarch validation
- kernel
- Linux 6.18 LTS, vanilla; THP madvise
- compiler
- clang 19, -O3 -march=native -mavx2
- dataset
- hnswlib (reference) + FAISS HNSW (production validation); SIFT-1M academic + 768-dim synthetic corpus to match modern RAG embeddings
TL;DR
HNSW’s runtime cost, once the index stops fitting in last-level cache, is dominated by pointer-chasing on graph descent — not distance math. Cache miss rates on the lower layers spike exponentially past the L3 cliff, and the “vector hangover” that 2026 RAG teams are hitting is a second-order effect of graph metadata blowing working-set budget. This post shows the miss profile directly, using perf c2c and hardware counters.
Methodology
| Field | Value |
|---|---|
| CPU | TBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654) |
| Microcode | pinned, recorded |
| Kernel | Linux 6.18 LTS, vanilla, THP madvise |
| Compiler | clang 19, -O3 -march=native -mavx2 |
| Dataset | SIFT-1M (academic) + 768-dim synthetic (1M–100M vectors, matches modern embeddings) |
| Index | hnswlib reference + FAISS HNSW validation, M={16, 32}, ef={50, 200} |
| Governor | performance, turbo on |
| Repro | lowlat-ms/lowlat-bench/hnsw-cache-pressure |
The question
- Define “cache pressure” precisely: L1/L2/L3 miss rates, LLC occupancy, prefetcher effectiveness, TLB walks.
- Single empirical question: where does the miss profile live during HNSW descent, and where is the cliff when the working set exceeds L3?
- Why this angle is novel: every other HNSW post lives at the API/recall/throughput level. Nobody has shown the miss profile.
Introduction
- Vector search is the hottest infra category of 2025–2026. HNSW is the dominant in-memory algorithm.
- Brief mechanism recap: multi-layer graph, descend from top, greedy search at the bottom.
- The “vector hangover” (https://tech-champion.com/database/the-vector-hangover-hnsw-index-memory-bloat-in-production-rag/) is the motivating real-world pain.
- Why mechanical sympathy is the missing lens here.
Setup
- hnswlib first for clarity (reference C++ implementation, clean code to instrument).
- FAISS HNSW second for production validation.
- Datasets: SIFT-1M baseline, 768-dim synthetic at 1M, 10M, and 100M to force the L3 and DRAM cliffs.
- Concurrency: single-query trace first, then N-query sustained load.
Baseline
- Trace a single query through the graph. Record cache misses per layer of descent.
- Break time down: distance math vs pointer load vs memory stall.
- Compare AoS vs SoA layout of vector storage on the same index.
- Establish recall@10 and latency baseline at
ef={50, 100, 200}.
Optimizations
- N queries/sec sustained load: map working-set size against L2, L3, and DRAM footprint.
- Find the cliff: the exact QPS at which p99 latency jumps by 10x.
- Pathologies: pointer chasing on random neighbor lookups, false sharing on concurrent updates, cold-tier accesses.
- Mitigations to test: prefetch neighbors before descent, per-thread LRU of recently-visited nodes, SoA vector storage, PQ quantization to shrink working set.
Results
- Per-layer cache miss heatmap during query descent.
- Working-set-vs-latency plot with the L3 cliff annotated.
perf c2coutput showing false sharing hotspots under concurrent insert.- AoS-vs-SoA delta: how much of the runtime is memory-bound in each.
Limitations
- One HNSW implementation at a time doesn’t generalize to all (Weaviate, Qdrant, Milvus, pgvector each lay out graph state differently).
- Dataset shape matters: SIFT-1M is small enough that everyone’s results look good; 768-dim is where pain starts.
- Hardware variation: L3 size varies wildly between Sapphire Rapids and Genoa, so the cliff location is hardware-shaped.
- Not a quantization deep-dive — PQ/SQ/binary deserve their own post.
Reproducibility
- Full repo:
lowlat-ms/lowlat-bench/hnsw-cache-pressure, one-command driver for all configs. - Raw
perf stat,perf c2c,perf memoutputs shipped alongside charts. - Dataset seeds and synthetic-vector generator pinned.
- Same hardware and repro harness as post #1 (io_uring vs epoll) for cross-post comparability.
References
- Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using HNSW graphs (2016) — https://arxiv.org/abs/1603.09320
- hnswlib reference implementation — https://github.com/nmslib/hnswlib
- FAISS — https://github.com/facebookresearch/faiss
- “The Vector Hangover: HNSW Index Memory Bloat in Production RAG” — https://tech-champion.com/database/the-vector-hangover-hnsw-index-memory-bloat-in-production-rag/
- P-HNSW: crash-consistent HNSW on persistent memory — https://www.mdpi.com/2076-3417/15/19/10554
- Vespa, “Billion-scale vector search using hybrid HNSW-IF” — https://medium.com/vespa/billion-scale-vector-search-using-hybrid-hnsw-if-96d7058037d3