TL;DR

io_uring’s advantage over epoll is real but workload-shaped: at small batch sizes on ping-pong traffic, epoll can match or beat it; the crossover shows up once you push queue depth past the per-syscall overhead floor, and SQPOLL + registered buffers widen the gap on latency rather than raw throughput. This post sets the methodology bar for every subsequent lowlat.ms benchmark.

Methodology

FieldValue
CPUTBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654)
Microcodepinned, recorded per run
KernelLinux 6.18 LTS, vanilla
Compilerclang 19, -O3 -march=native
Governorperformance, turbo on, SMT on (recorded both ways)
Hugepagesexplicit on, mTHP off (control variable)
NUMApinned to one node; cross-node run as sensitivity check
DatasetTCP echo, conn counts {1, 64, 1024, 8192}, msg sizes {64B, 1KB, 16KB}
Reprolowlat-ms/lowlat-bench/io-uring-vs-epoll

The question

  • Restate the single empirical question: at what QD, connection count, and message size does io_uring beat epoll on Linux 6.18 LTS?
  • Note the three implementations under test: blocking read(), epoll + non-blocking sockets, io_uring with multishot recv.
  • Optional fourth variant: io_uring + SQPOLL + registered fixed buffers.

Introduction

  • Why this is the first lowlat.ms post: most 2025–2026 coverage is opinion, not measurement.
  • Brief history: read()epoll (2.5.44) → io_uring (5.1) → multishot recv (6.0).
  • What “fair” means here: same workload, same hardware, same kernel, three event loops.
  • Pointer to liburing issues #189 and #536 as documented cases where io_uring underperforms.

Setup

  • Hardware pinning: CPU affinity, IRQ affinity, NIC RSS queues mapped to cores.
  • NIC tuning: ethtool -C coalescing settings recorded; interrupts pinned.
  • Server topology: single-node run + cross-node run to isolate NUMA effects.
  • Client: separate machine driving load via wrk2-style open-loop generator (rate-locked, not closed-loop).

Baseline

  • Start with the blocking read() implementation as the floor.
  • Record throughput, p50/p99/p9999 latency, cycles/byte, syscalls/sec.
  • Build the epoll baseline next; confirm it beats blocking past ~64 connections as expected.
  • Sanity-check against public numbers from the liburing repo benchmarks.

Optimizations

  • io_uring, plain submit/wait, no SQPOLL: the “naive” variant.
  • io_uring + multishot recv: amortize SQE setup across many completions.
  • io_uring + SQPOLL + registered fixed buffers + registered files: the “throw everything at it” variant.
  • Sweep queue depth {1, 4, 16, 64, 256}; the crossover point is the headline number.

Results

  • Throughput curves per implementation across QD and connection counts.
  • Latency percentiles (p50/p99/p9999) at each QD — io_uring’s tail wins are the surprising part.
  • perf stat output: cache misses, branch mispredicts, syscall count per byte.
  • Efficiency plot: cycles per byte, not just raw RPS. Streaming vs ping-pong split out.

Limitations

  • Single workload shape. Streaming vs ping-pong behave differently; don’t overgeneralize.
  • Single kernel version. 6.18 LTS results do not necessarily hold on 6.6 or older.
  • Single NIC. 100GbE Mellanox vs Broadcom vs Intel has its own story.
  • No aio (obsolete in 2026); no POSIX_AIO (user-space thread pool, not comparable).

Reproducibility

  • Full repo: lowlat-ms/lowlat-bench/io-uring-vs-epoll with justfile for one-command runs.
  • Recorded kernel config, microcode, ethtool -k, numactl --hardware per run.
  • Raw perf output included alongside the chart data.
  • CI job: nightly re-run on a pinned bare-metal box, diff against baseline.

References