NUMA

Non-Uniform Memory Access — multi-socket and multi-chiplet systems where each CPU has faster access to its local memory than to remote memory on another node, making placement matter.

also known as numa · non-uniform-memory-access

stack cpu · memory · bus / dma

NUMA (Non-Uniform Memory Access) describes systems where main memory is divided into nodes, each attached to a specific CPU (socket, or on some modern parts, chiplet/die). A thread accessing memory on its local node pays the lowest latency; accessing memory on a remote node goes over an inter-socket interconnect (Intel UPI/QPI, AMD Infinity Fabric, Arm CCIX/CXL) and pays extra latency and contention-sensitive bandwidth.

On a two-socket Xeon or EPYC server, local DRAM latency might be ~80 ns while remote is ~130 ns — and sustained remote bandwidth is 50–70% of local. On AMD EPYC with multiple NUMA domains per socket (because of chiplets and NPS modes), even single-socket access can be non-uniform.

For latency-sensitive work:

  • First-touch allocation: Linux allocates a page on the first node to touch it. Initializing an array in the master thread before handing it to workers is a NUMA anti-pattern.
  • numactl --cpunodebind --membind pins a process to specific CPU and memory nodes.
  • numa_alloc_local / mbind give fine-grained control.
  • NUMA balancing — the Linux kernel periodically migrates pages and threads to co-locate. Can hurt latency-sensitive workloads; disable with numa_balancing=0.
  • perf stat -e node-load-misses,node-loads and numastat surface NUMA misses.

Modern cache coherency (MESI/MOESI variants) extends across NUMA nodes but the coherence traffic is expensive. Avoiding unnecessary cross-node sharing is often the highest-leverage optimization on multi-socket servers.

sources