memory bandwidth

Sustained rate at which a CPU can read from or write to main memory — measured in GB/s per socket — and one of the two fundamental ceilings (with latency) for data-intensive workloads.

also known as dram-bandwidth · memory-throughput

stack memory · cpu · bus / dma

Memory bandwidth is the sustained rate at which a CPU can read from or write to main memory, typically expressed in gigabytes per second per socket. A modern 8-channel DDR5 server (Sapphire Rapids, Genoa) delivers ~300 GB/s of peak memory bandwidth per socket; sustained STREAM Triad numbers are usually 70–80% of peak. Add more sockets and you multiply local bandwidth, but cross-socket traffic is bottlenecked by UPI/Infinity Fabric — which is why workloads that chase cross-socket data rarely see linear scaling.

Memory bandwidth is one of the two fundamental memory ceilings; the other is latency. Streaming workloads (large-array SIMD kernels, columnar scans, matrix multiply with warm caches) are bandwidth-bound. Pointer-chasing workloads (linked lists, graph traversal, HNSW descent) are latency-bound — they cannot use the full bandwidth because each load depends on the previous one.

The practical knobs:

STREAM — the canonical bandwidth benchmark; reports Copy/Scale/Add/Triad numbers.
Intel MLC (Memory Latency Checker) — measures latency vs bandwidth at varied access patterns.
Huge pages reduce page-walk overhead and improve sustained bandwidth on scan workloads (fewer TLB misses mean the DRAM subsystem sees more useful requests per unit time).
Memory channels populated symmetrically — half-populated DIMM slots halve aggregate bandwidth. This is a surprisingly common production misconfiguration.
NUMA balance — thread and data placement determine which socket’s memory controllers you actually hit.

Rule of thumb: if perf stat shows cycle_activity.stalls_l3_miss high, you’re bandwidth-bound. If mem_load_retired.l3_miss latency is high but stalls are lower, you’re latency-bound. The fixes are different.

related

sources