io_uring vs epoll vs blocking read on Linux 6.18 LTS — a fair benchmark

At what request rate, batch size, and connection count does io_uring beat epoll for a real network read workload on Linux 6.18 LTS — and does it hold across kernels, NUMA topologies, and SQPOLL modes?

date: 2026-04-11
author: Jonathan
read: 3 min
stack: application · syscall · kernel · network

methodology reproducible

hardware: TBD — target: Intel Xeon Platinum 8480+ (Sapphire Rapids, 56C) and AMD EPYC 9654 (Genoa, 96C) for cross-uarch validation
kernel: Linux 6.18 LTS, vanilla; microcode pinned and recorded
compiler: clang 19, -O3 -march=native
dataset: TCP echo workload, connection counts {1, 64, 1024, 8192}, message sizes {64B, 1KB, 16KB}

TL;DR

io_uring’s advantage over epoll is real but workload-shaped: at small batch sizes on ping-pong traffic, epoll can match or beat it; the crossover shows up once you push queue depth past the per-syscall overhead floor, and SQPOLL + registered buffers widen the gap on latency rather than raw throughput. This post sets the methodology bar for every subsequent lowlat.ms benchmark.

Methodology

Field	Value
CPU	TBD — target Sapphire Rapids (8480+) and Genoa (EPYC 9654)
Microcode	pinned, recorded per run
Kernel	Linux 6.18 LTS, vanilla
Compiler	clang 19, `-O3 -march=native`
Governor	`performance`, turbo on, SMT on (recorded both ways)
Hugepages	explicit on, mTHP off (control variable)
NUMA	pinned to one node; cross-node run as sensitivity check
Dataset	TCP echo, conn counts `{1, 64, 1024, 8192}`, msg sizes `{64B, 1KB, 16KB}`
Repro	`lowlat-ms/lowlat-bench/io-uring-vs-epoll`

The question

Restate the single empirical question: at what QD, connection count, and message size does io_uring beat epoll on Linux 6.18 LTS?
Note the three implementations under test: blocking read(), epoll + non-blocking sockets, io_uring with multishot recv.
Optional fourth variant: io_uring + SQPOLL + registered fixed buffers.

Introduction

Why this is the first lowlat.ms post: most 2025–2026 coverage is opinion, not measurement.
Brief history: read() → epoll (2.5.44) → io_uring (5.1) → multishot recv (6.0).
What “fair” means here: same workload, same hardware, same kernel, three event loops.
Pointer to liburing issues #189 and #536 as documented cases where io_uring underperforms.

Setup

Hardware pinning: CPU affinity, IRQ affinity, NIC RSS queues mapped to cores.
NIC tuning: ethtool -C coalescing settings recorded; interrupts pinned.
Server topology: single-node run + cross-node run to isolate NUMA effects.
Client: separate machine driving load via wrk2-style open-loop generator (rate-locked, not closed-loop).

Baseline

Start with the blocking read() implementation as the floor.
Record throughput, p50/p99/p9999 latency, cycles/byte, syscalls/sec.
Build the epoll baseline next; confirm it beats blocking past ~64 connections as expected.
Sanity-check against public numbers from the liburing repo benchmarks.

Optimizations

io_uring, plain submit/wait, no SQPOLL: the “naive” variant.
io_uring + multishot recv: amortize SQE setup across many completions.
io_uring + SQPOLL + registered fixed buffers + registered files: the “throw everything at it” variant.
Sweep queue depth {1, 4, 16, 64, 256}; the crossover point is the headline number.

Results

Throughput curves per implementation across QD and connection counts.
Latency percentiles (p50/p99/p9999) at each QD — io_uring’s tail wins are the surprising part.
perf stat output: cache misses, branch mispredicts, syscall count per byte.
Efficiency plot: cycles per byte, not just raw RPS. Streaming vs ping-pong split out.

Limitations

Single workload shape. Streaming vs ping-pong behave differently; don’t overgeneralize.
Single kernel version. 6.18 LTS results do not necessarily hold on 6.6 or older.
Single NIC. 100GbE Mellanox vs Broadcom vs Intel has its own story.
No aio (obsolete in 2026); no POSIX_AIO (user-space thread pool, not comparable).

Reproducibility

Full repo: lowlat-ms/lowlat-bench/io-uring-vs-epoll with justfile for one-command runs.
Recorded kernel config, microcode, ethtool -k, numactl --hardware per run.
Raw perf output included alongside the chart data.
CI job: nightly re-run on a pinned bare-metal box, diff against baseline.

References

Jens Axboe, Efficient IO with io_uring — https://kernel.dk/io_uring.pdf
liburing repository — https://github.com/axboe/liburing
liburing issues #189, #536 — cases where io_uring underperforms epoll
“From epoll to io_uring’s Multishot Receives” — https://codemia.io/blog/path/From-epoll-to-iourings-Multishot-Receives—Why-2025-Is-the-Year-We-Finally-Kill-the-Event-Loop
Linux kernel source: fs/io_uring/ and fs/eventpoll.c
Aleksey Shipilëv, Nanotrusting the Nanotime — https://shipilev.net/blog/2014/nanotrusting-nanotime/ (methodology reference)