prefetcher

CPU hardware that predicts upcoming memory accesses and pulls cache lines into L1/L2 before the demand load arrives; effective on regular stride patterns, useless on pointer chasing.

also known as hardware-prefetcher · prefetch

stack cpu · cache

A prefetcher is a CPU hardware unit that predicts which memory addresses the program will need soon and issues speculative loads for them, so the cache line is already present when the demand load arrives. Modern x86 has several prefetchers operating at L1, L2, and L3, including the DCU (L1 data cache unit) prefetcher, the DCU IP prefetcher, and the L2 streamer and adjacent-cache-line prefetchers. AMD and Arm have their own variants.

Prefetchers are extremely effective on regular access patterns: sequential scans, strided loops, SIMD kernels over arrays. In those cases, the prefetcher hides memory latency entirely — the profiler shows L1 hits even though the data wasn’t there at the start of the loop. This is why SoA layouts and tight stride loops feel “free” on modern CPUs.

They are largely useless on pointer chasing. A linked list, a tree, or an HNSW graph traversal has an address-dependent access pattern: you can’t know the next address until you load the current node. The prefetcher cannot predict what it cannot see, so each hop eats the full DRAM latency (100+ ns). This is the core mechanical-sympathy reason pointer-chasing data structures underperform flat ones.

Software hints exist: __builtin_prefetch on GCC/Clang inserts PREFETCHT0/T1/T2/NTA instructions that nudge the hardware to fetch a specific line. They help when you know the next pointer one iteration early — a common trick in graph-walking and hash-table code.

related

sources