io_uring vs epoll vs blocking read on Linux 6.18 LTS — a fair benchmark
At what request rate, batch size, and connection count does io_uring beat epoll for a real network read workload on Linux 6.18 LTS — and does it hold across kernels, NUMA topologies, and SQPOLL modes?
- hardware
- TBD — target: Intel Xeon Platinum 8480+ (Sapphire Rapids, 56C) and AMD EPYC 9654 (Genoa, 96C) for cross-uarch validation
- kernel
- Linux 6.18 LTS, vanilla; microcode pinned and recorded
- compiler
- clang 19, -O3 -march=native
- dataset
- TCP echo workload, connection counts {1, 64, 1024, 8192}, message sizes {64B, 1KB, 16KB}
TL;DR
io_uring’s advantage over epoll is real but workload-shaped: at small batch sizes on ping-pong traffic, epoll can match or beat it; the crossover shows up once you push queue depth past the per-syscall overhead floor, and SQPOLL + registered buffers widen the gap on latency rather than raw throughput. This post sets the methodology bar for every subsequent lowlat.ms benchmark.
Methodology
| Field | Value |
|---|---|
| CPU | TBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654) |
| Microcode | pinned, recorded per run |
| Kernel | Linux 6.18 LTS, vanilla |
| Compiler | clang 19, -O3 -march=native |
| Governor | performance, turbo on, SMT on (recorded both ways) |
| Hugepages | explicit on, mTHP off (control variable) |
| NUMA | pinned to one node; cross-node run as sensitivity check |
| Dataset | TCP echo, conn counts {1, 64, 1024, 8192}, msg sizes {64B, 1KB, 16KB} |
| Repro | lowlat-ms/lowlat-bench/io-uring-vs-epoll |
The question
- Restate the single empirical question: at what QD, connection count, and message size does io_uring beat epoll on Linux 6.18 LTS?
- Note the three implementations under test: blocking
read(),epoll+ non-blocking sockets,io_uringwith multishot recv. - Optional fourth variant:
io_uring+ SQPOLL + registered fixed buffers.
Introduction
- Why this is the first lowlat.ms post: most 2025–2026 coverage is opinion, not measurement.
- Brief history:
read()→epoll(2.5.44) →io_uring(5.1) → multishot recv (6.0). - What “fair” means here: same workload, same hardware, same kernel, three event loops.
- Pointer to liburing issues #189 and #536 as documented cases where io_uring underperforms.
Setup
- Hardware pinning: CPU affinity, IRQ affinity, NIC RSS queues mapped to cores.
- NIC tuning:
ethtool -Ccoalescing settings recorded; interrupts pinned. - Server topology: single-node run + cross-node run to isolate NUMA effects.
- Client: separate machine driving load via
wrk2-style open-loop generator (rate-locked, not closed-loop).
Baseline
- Start with the blocking
read()implementation as the floor. - Record throughput, p50/p99/p9999 latency, cycles/byte, syscalls/sec.
- Build the
epollbaseline next; confirm it beats blocking past ~64 connections as expected. - Sanity-check against public numbers from the liburing repo benchmarks.
Optimizations
- io_uring, plain submit/wait, no SQPOLL: the “naive” variant.
- io_uring + multishot recv: amortize SQE setup across many completions.
- io_uring + SQPOLL + registered fixed buffers + registered files: the “throw everything at it” variant.
- Sweep queue depth
{1, 4, 16, 64, 256}; the crossover point is the headline number.
Results
- Throughput curves per implementation across QD and connection counts.
- Latency percentiles (p50/p99/p9999) at each QD — io_uring’s tail wins are the surprising part.
perf statoutput: cache misses, branch mispredicts, syscall count per byte.- Efficiency plot: cycles per byte, not just raw RPS. Streaming vs ping-pong split out.
Limitations
- Single workload shape. Streaming vs ping-pong behave differently; don’t overgeneralize.
- Single kernel version. 6.18 LTS results do not necessarily hold on 6.6 or older.
- Single NIC. 100GbE Mellanox vs Broadcom vs Intel has its own story.
- No
aio(obsolete in 2026); noPOSIX_AIO(user-space thread pool, not comparable).
Reproducibility
- Full repo:
lowlat-ms/lowlat-bench/io-uring-vs-epollwithjustfilefor one-command runs. - Recorded kernel config, microcode,
ethtool -k,numactl --hardwareper run. - Raw
perfoutput included alongside the chart data. - CI job: nightly re-run on a pinned bare-metal box, diff against baseline.
References
- Jens Axboe, Efficient IO with io_uring — https://kernel.dk/io_uring.pdf
- liburing repository — https://github.com/axboe/liburing
- liburing issues #189, #536 — cases where io_uring underperforms epoll
- “From epoll to io_uring’s Multishot Receives” — https://codemia.io/blog/path/From-epoll-to-iourings-Multishot-Receives—Why-2025-Is-the-Year-We-Finally-Kill-the-Event-Loop
- Linux kernel source:
fs/io_uring/andfs/eventpoll.c - Aleksey Shipilëv, Nanotrusting the Nanotime — https://shipilev.net/blog/2014/nanotrusting-nanotime/ (methodology reference)