SIMD

Single Instruction, Multiple Data — CPU instructions that perform the same operation on a vector of values in parallel; SSE, AVX, AVX2, AVX-512 on x86, NEON/SVE on Arm.

also known as vectorization · vector-instructions

stack cpu

SIMD — Single Instruction, Multiple Data — is a class of CPU instructions that apply one operation to a vector of values in parallel. Instead of issuing four add instructions to add four pairs of integers, a single paddd (or vpaddd, or Arm NEON add.4s) adds all four in one instruction. Modern x86 offers SSE (128-bit vectors, 4x float32), AVX/AVX2 (256-bit, 8x float32), and AVX-512 (512-bit, 16x float32). Arm offers NEON (128-bit) and SVE/SVE2 (scalable, up to 2048-bit).

SIMD’s speedup isn’t just “more lanes”. It also eliminates per-element loop overhead (increment, compare, branch), reduces instruction-cache pressure, and lines up well with hardware memory prefetchers when data is contiguous. For numerical kernels, signal processing, string search, hashing, and vector distance math (as in HNSW), SIMD can deliver 4–16x speedups over scalar code.

Gotchas:

Alignment matters: vmovaps requires 32-byte alignment for AVX2; unaligned vmovups is usually fine on recent CPUs but historically slower.
Register pressure and downclocking: heavy AVX-512 use causes frequency throttling on many Intel parts; AVX2 is usually the sweet spot.
Data layout: SoA feeds SIMD naturally; AoS requires gathers or shuffles that can dominate the kernel.
Autovectorization is unreliable: LLVM/GCC will vectorize obvious loops but give up on anything with dependencies, non-contiguous access, or mixed types. Hand-written intrinsics or libraries like Google Highway are often needed.

related

sources