NUMA
Non-Uniform Memory Access — multi-socket and multi-chiplet systems where each CPU has faster access to its local memory than to remote memory on another node, making placement matter.
NUMA (Non-Uniform Memory Access) describes systems where main memory is divided into nodes, each attached to a specific CPU (socket, or on some modern parts, chiplet/die). A thread accessing memory on its local node pays the lowest latency; accessing memory on a remote node goes over an inter-socket interconnect (Intel UPI/QPI, AMD Infinity Fabric, Arm CCIX/CXL) and pays extra latency and contention-sensitive bandwidth.
On a two-socket Xeon or EPYC server, local DRAM latency might be ~80 ns while remote is ~130 ns — and sustained remote bandwidth is 50–70% of local. On AMD EPYC with multiple NUMA domains per socket (because of chiplets and NPS modes), even single-socket access can be non-uniform.
For latency-sensitive work:
- First-touch allocation: Linux allocates a page on the first node to touch it. Initializing an array in the master thread before handing it to workers is a NUMA anti-pattern.
numactl --cpunodebind --membindpins a process to specific CPU and memory nodes.numa_alloc_local/mbindgive fine-grained control.- NUMA balancing — the Linux kernel periodically migrates pages and threads to co-locate. Can hurt latency-sensitive workloads; disable with
numa_balancing=0. perf stat -e node-load-misses,node-loadsandnumastatsurface NUMA misses.
Modern cache coherency (MESI/MOESI variants) extends across NUMA nodes but the coherence traffic is expensive. Avoiding unnecessary cross-node sharing is often the highest-leverage optimization on multi-socket servers.