TL;DR

HNSW’s runtime cost, once the index stops fitting in last-level cache, is dominated by pointer-chasing on graph descent — not distance math. Cache miss rates on the lower layers spike exponentially past the L3 cliff, and the “vector hangover” that 2026 RAG teams are hitting is a second-order effect of graph metadata blowing working-set budget. This post shows the miss profile directly, using perf c2c and hardware counters.

Methodology

FieldValue
CPUTBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654)
Microcodepinned, recorded
KernelLinux 6.18 LTS, vanilla, THP madvise
Compilerclang 19, -O3 -march=native -mavx2
DatasetSIFT-1M (academic) + 768-dim synthetic (1M–100M vectors, matches modern embeddings)
Indexhnswlib reference + FAISS HNSW validation, M={16, 32}, ef={50, 200}
Governorperformance, turbo on
Reprolowlat-ms/lowlat-bench/hnsw-cache-pressure

The question

  • Define “cache pressure” precisely: L1/L2/L3 miss rates, LLC occupancy, prefetcher effectiveness, TLB walks.
  • Single empirical question: where does the miss profile live during HNSW descent, and where is the cliff when the working set exceeds L3?
  • Why this angle is novel: every other HNSW post lives at the API/recall/throughput level. Nobody has shown the miss profile.

Introduction

Setup

  • hnswlib first for clarity (reference C++ implementation, clean code to instrument).
  • FAISS HNSW second for production validation.
  • Datasets: SIFT-1M baseline, 768-dim synthetic at 1M, 10M, and 100M to force the L3 and DRAM cliffs.
  • Concurrency: single-query trace first, then N-query sustained load.

Baseline

  • Trace a single query through the graph. Record cache misses per layer of descent.
  • Break time down: distance math vs pointer load vs memory stall.
  • Compare AoS vs SoA layout of vector storage on the same index.
  • Establish recall@10 and latency baseline at ef={50, 100, 200}.

Optimizations

  • N queries/sec sustained load: map working-set size against L2, L3, and DRAM footprint.
  • Find the cliff: the exact QPS at which p99 latency jumps by 10x.
  • Pathologies: pointer chasing on random neighbor lookups, false sharing on concurrent updates, cold-tier accesses.
  • Mitigations to test: prefetch neighbors before descent, per-thread LRU of recently-visited nodes, SoA vector storage, PQ quantization to shrink working set.

Results

  • Per-layer cache miss heatmap during query descent.
  • Working-set-vs-latency plot with the L3 cliff annotated.
  • perf c2c output showing false sharing hotspots under concurrent insert.
  • AoS-vs-SoA delta: how much of the runtime is memory-bound in each.

Limitations

  • One HNSW implementation at a time doesn’t generalize to all (Weaviate, Qdrant, Milvus, pgvector each lay out graph state differently).
  • Dataset shape matters: SIFT-1M is small enough that everyone’s results look good; 768-dim is where pain starts.
  • Hardware variation: L3 size varies wildly between Sapphire Rapids and Genoa, so the cliff location is hardware-shaped.
  • Not a quantization deep-dive — PQ/SQ/binary deserve their own post.

Reproducibility

  • Full repo: lowlat-ms/lowlat-bench/hnsw-cache-pressure, one-command driver for all configs.
  • Raw perf stat, perf c2c, perf mem outputs shipped alongside charts.
  • Dataset seeds and synthetic-vector generator pinned.
  • Same hardware and repro harness as post #1 (io_uring vs epoll) for cross-post comparability.

References