io_uring vs epoll vs Blocking I/O: Latency Analysis
Blocking pipe I/O measured at 940 ns vs epoll at 1,120 ns per operation. How io_uring's ring buffers change the calculus.
- hardware
- Intel Xeon @ 2.60 GHz (x86_64)
- kernel
- Linux 6.1.158 SMP PREEMPT_DYNAMIC
- compiler
- Python 3.13.12 (CPython)
- dataset
- Pipe I/O with 64-byte messages, 25,000+ operations per config
TL;DR
Blocking pipe I/O completes a write-read cycle in 940 ns median on the test system, while epoll adds an extra notification step bringing the median to 1,120 ns—a ~19% overhead per operation. However, epoll’s advantage emerges at scale: its $O(1)$ readiness notification remains flat from 1 to 1,000 monitored file descriptors. io_uring takes this further by eliminating most syscalls entirely through shared ring buffers, with Axboe’s reference benchmarks reporting 1.7M 4K IOPS in polled mode [1].
Problem Statement
Every network server and storage engine must choose an I/O model. The choice determines how many syscalls occur per operation, how the application scales across connections, and the floor on per-operation latency. Linux provides three primary models that represent successive generations of I/O interface design: blocking I/O, epoll, and io_uring. Each makes different tradeoffs between simplicity, syscall overhead, and scalability.
Syscalls are not free. Each user-to-kernel transition requires saving register state, switching privilege levels, and—on systems with Kernel Page Table Isolation (KPTI) enabled as a Meltdown mitigation—flushing and switching page tables [5]. On the test system, even the minimal getpid() syscall costs 455 ns median. For I/O-intensive workloads processing millions of operations per second, the cumulative cost of syscalls becomes a dominant factor.
This post examines the architectural differences between the three models, measures their per-operation costs through pipe I/O benchmarks, and analyzes where each model provides advantages.
Background: Three I/O Models
Blocking I/O
The simplest model. A thread calls read() or write() and blocks until the operation completes. Each I/O operation requires exactly one syscall. To handle multiple connections simultaneously, the application typically spawns one thread per connection.
The model is straightforward to reason about: control flow proceeds linearly. The cost is one syscall per operation, plus the overhead of thread scheduling and context switching when managing many connections. With thousands of connections, thread stack memory consumption and scheduler overhead become bottlenecks.
epoll
Introduced in Linux kernel 2.5.45 (October 2002), epoll replaced the older select(2) and poll(2) interfaces [5]. Where select and poll operate in $O(n)$ time—scanning every monitored file descriptor on each call—epoll uses an in-kernel red-black tree to track registered file descriptors and maintains a ready list, achieving $O(1)$ notification for ready events [2][5].
The epoll API consists of three syscalls:
epoll_create1()— creates an epoll instanceepoll_ctl()— adds, modifies, or removes file descriptors from the interest listepoll_wait()— blocks until one or more file descriptors become ready
A single thread can monitor thousands of file descriptors. When epoll_wait returns, the application then issues the actual read() or write() syscalls on the ready descriptors. This means epoll requires at minimum two syscalls per I/O event: one epoll_wait plus one data-transfer syscall. When epoll_wait returns multiple ready events, the notification cost is amortized.
epoll supports both level-triggered (default) and edge-triggered (EPOLLET) modes [2]. Edge-triggered mode delivers events only when the state of a file descriptor changes, reducing redundant notifications but requiring the application to drain all available data on each event.
io_uring
Adopted in Linux kernel 5.1 (2019) and designed by Jens Axboe, io_uring addresses the fundamental limitation shared by all prior Linux I/O interfaces: the requirement for at least one syscall per I/O submission [1][6].
io_uring communicates between user space and the kernel through two shared ring buffers mapped into the application’s address space via mmap() [3][4]:
- Submission Queue (SQ): The application writes Submission Queue Entries (SQEs) describing desired I/O operations.
- Completion Queue (CQ): The kernel writes Completion Queue Events (CQEs) with the results of completed operations.
Because both queues reside in shared memory, the application can submit multiple I/O requests by writing SQEs to the ring buffer and then making a single io_uring_enter() syscall—or, in SQPOLL mode, no syscall at all. In SQPOLL mode, a dedicated kernel thread continuously polls the submission queue for new entries, eliminating the submission syscall entirely [1][7].
The syscall-per-operation cost for each model breaks down as follows:
| Model | Syscalls per I/O Operation | Notes |
|---|---|---|
| Blocking | 1 | read() or write() |
| epoll (single event) | 2 | epoll_wait() + read()/write() |
| epoll (N events batched) | $(N+1)/N$ | Amortized across batch |
| io_uring (batched) | $1/N$ | One io_uring_enter() for N operations |
| io_uring (SQPOLL) | 0 | Kernel thread polls SQ |
Methodology
Hardware: Intel Xeon Processor @ 2.60 GHz (x86_64)
Kernel: Linux 6.1.158, SMP PREEMPT_DYNAMIC
Runtime: Python 3.13.12 (CPython)
Timing: time.perf_counter_ns() (nanosecond resolution)
I/O Primitive: Unix pipes (os.pipe())
Message Size: 64 bytes (default), also tested at 256, 1024, 4096, and 16384 bytes
Repetitions: 30 independent runs per configuration, 5,000 operations per run (150,000 total operations per config for per-run medians; 25,000 aggregate samples for distribution statistics)
The benchmarks use Python’s os.read(), os.write(), and select.epoll() wrappers, which add a constant overhead to every measurement from CPython function call dispatch and object allocation. This overhead affects absolute numbers but does not change the relative comparison between models, as all paths go through the same Python layer before reaching the underlying syscall.
io_uring benchmarks could not be executed directly in the sandbox due to the absence of liburing Python bindings and the restricted syscall environment. io_uring performance data is drawn from cited external sources and clearly distinguished from measured results.
Experiment
Five benchmark configurations were executed:
- Syscall baseline: 100,000 iterations of
getpid()via ctypes to measure bare syscall cost. - Blocking pipe I/O:
os.write()+os.read()of 64-byte messages through a Unix pipe. Measures the cost of two syscalls with actual data transfer. - epoll pipe I/O: Data written to pipe, then
epoll.poll()+os.read()timed as the “receive path.” Measures the cost of readiness notification plus data transfer. - epoll FD scalability: 1 to 1,000 monitored file descriptors with single-event readiness. Measures whether epoll maintains $O(1)$ behavior.
- epoll batching: 1 to 64 pipes made ready simultaneously, measuring amortized per-event cost.
Each configuration was run 30 times with fresh pipe file descriptors per run.
Results
Per-Operation Latency

| Metric | getpid() | Blocking (write+read) | epoll (wait+read) |
|---|---|---|---|
| Median | 455 ns | 940 ns | 1,120 ns |
| Mean | 488 ns | 982 ns | 1,518 ns |
| Stddev | 673 ns | 736 ns | 1,980 ns |
| P99 | 862 ns | 1,354 ns | 3,381 ns |
Blocking I/O shows a 19% lower median and a 60% lower P99 compared to epoll for single-connection pipe operations. The epoll path’s higher variance (stddev of 1,980 ns vs 736 ns) and elevated P99 indicate that epoll_wait() introduces additional tail latency, consistent with the extra kernel-side ready-list processing.
Single-Connection Throughput

For a single pipe with 64-byte messages:
- Blocking: 981,583 ops/sec (median across 30 runs)
- epoll: 580,817 ops/sec (median across 30 runs)
Blocking I/O achieves 1.69× the throughput of epoll in this single-connection scenario. This result is expected: epoll’s purpose is not to accelerate individual connections but to multiplex many connections on a single thread.
epoll Scalability

| Monitored FDs | epoll_wait + read Median (ns) |
|---|---|
| 1 | 1,210 |
| 10 | 1,232 |
| 100 | 1,295 |
| 500 | 1,343 |
| 1,000 | 1,341 |
The latency increase from 1 to 1,000 monitored file descriptors is 131 ns (10.8%), confirming epoll’s $O(1)$ readiness notification behavior. This is the property that made epoll the standard for high-connection-count servers, replacing select() and poll() which scale as $O(n)$ in the number of monitored descriptors [5].
Batching Amortization

| Batch Size | Amortized Cost per Event (ns) | Reduction vs Batch=1 |
|---|---|---|
| 1 | 1,334 | — |
| 4 | 758 | 43% |
| 16 | 607 | 54% |
| 64 | 549 | 59% |
When 64 events are ready per epoll_wait call, the amortized per-event cost drops to 549 ns—a 2.4× reduction from the unbatched case. This demonstrates the principle that io_uring exploits to its logical conclusion: amortizing the syscall cost across as many operations as possible.
Syscall Count by Model

Analysis
Syscall Cost Decomposition
The measured getpid() cost of 455 ns provides a baseline for a minimal syscall that performs no I/O. The blocking write+read pair at 940 ns for two syscalls yields approximately 470 ns per syscall—consistent with the getpid() baseline when accounting for the small amount of pipe buffer management the kernel performs.
The epoll path’s 1,120 ns for epoll_wait + read (two syscalls) shows that epoll_wait is slightly more expensive than a simple read, which is expected given that the kernel must check the ready list and return event information.
Where Blocking Wins
For single-connection or low-connection-count scenarios, blocking I/O delivers the lowest per-operation latency. Applications with a known, small number of connections—such as a database client with a connection pool of 8-16 connections—may achieve better latency with a thread-per-connection model than with epoll, provided thread count remains manageable.
Where epoll Wins
epoll’s advantage materializes with many concurrent connections. A single thread can monitor thousands of file descriptors with near-constant overhead. The $O(1)$ scaling measured here—1,210 ns at 1 FD to 1,341 ns at 1,000 FDs—means that a server handling 10,000 connections pays roughly the same notification cost as one handling 10.
The batching results further demonstrate that under high load (many simultaneous ready events), epoll’s amortized per-event cost approaches the cost of a single read() syscall, as the epoll_wait cost is shared across all ready events.
Where io_uring Wins
io_uring’s architectural advantage is the elimination of per-operation syscalls. In standard mode, a single io_uring_enter() call can submit an arbitrary number of SQEs, reducing the syscall count to $1/N$ per operation [1][3]. In SQPOLL mode, the kernel thread polls the submission queue, and the application writes SQEs to shared memory without any syscall [1][7].
According to Axboe’s io_uring design paper, polled io_uring achieved 1.7M 4K IOPS on a reference system, compared to 608K IOPS for Linux AIO [4]. This 2.8× improvement comes from multiple factors: fewer syscalls, zero-copy submission via shared memory, and polled completion avoiding interrupt overhead.
io_uring also supports operations that epoll cannot batch: accept(), connect(), send(), recv(), fsync(), and many others can all be submitted as SQEs in a single batch [3]. With epoll, each of these requires a separate syscall after readiness notification.
Security Considerations
io_uring’s expanded kernel attack surface is a documented concern. Google’s security team reported that 60% of exploits submitted to their bug bounty program in 2022 targeted io_uring vulnerabilities [6]. As a result, io_uring has been disabled in Android, ChromeOS, Google’s production servers, and Docker’s default seccomp profile [6]. Applications in security-sensitive environments should weigh the performance benefits against this attack surface.
Limitations
-
Pipe I/O only. These benchmarks measure Unix pipe latency, not socket or disk I/O. Network I/O involves additional kernel networking stack overhead that would change the absolute numbers, though the relative syscall cost comparison remains applicable.
-
Python overhead. CPython adds constant overhead to every measurement (~100-200 ns per function call). The absolute numbers are higher than what a C program would measure. The relative comparisons between models remain valid because the Python overhead is consistent across all paths.
-
No io_uring measurements. The sandbox environment lacks
liburingbindings and may restrict theio_uring_setupsyscall. All io_uring performance data is cited from external sources. A complete comparison would require native C benchmarks withliburingon bare metal. -
Single-threaded. All benchmarks run single-threaded. The blocking I/O model’s real-world performance with thousands of connections would require threads, introducing context-switch overhead not captured here.
-
Virtualized environment. The sandbox runs on a virtualized Intel Xeon. Bare-metal systems would show lower absolute latencies and potentially different relative behaviors, particularly for KPTI overhead which interacts with hypervisor-level page table management.
-
Message size. The 64-byte default message size is small. Larger messages shift the bottleneck from syscall overhead to data copy time, potentially reducing the relative difference between models.
Conclusion
The three Linux I/O models represent a progression in minimizing syscall overhead. Blocking I/O delivers the lowest per-operation latency (940 ns median) for single-connection workloads by requiring exactly one syscall per data-transfer operation. epoll trades a ~19% per-operation overhead (1,120 ns median) for $O(1)$ scalability across thousands of connections, with amortized per-event cost dropping to 549 ns when 64 events batch per epoll_wait call. io_uring completes the progression by moving I/O submission and completion to shared memory ring buffers, reducing syscalls to near zero in SQPOLL mode.
The choice between these models is not universal. Blocking I/O suits low-connection-count, latency-sensitive applications. epoll remains the standard for high-connection-count servers where thread-per-connection is impractical. io_uring offers the highest theoretical throughput for syscall-dominated workloads, with the caveats of greater API complexity and a larger kernel attack surface.
Further Reading
- Lord of the io_uring — Deep tutorial and reference for io_uring programming
- LWN: The rapid growth of io_uring — Jonathan Corbet’s overview of io_uring’s expanding feature set
- liburing GitHub repository — The recommended user-space library for io_uring
- epoll(7) man page — Authoritative Linux documentation for the epoll API
- io_uring(7) man page — Authoritative Linux documentation for the io_uring API
References
[1] Axboe, Jens. “Efficient IO with io_uring.” kernel.dk, 2019. https://kernel.dk/io_uring.pdf
[2] “epoll(7) — Linux manual page.” man7.org. https://man7.org/linux/man-pages/man7/epoll.7.html
[3] “io_uring(7) — Linux manual page.” man7.org. https://man7.org/linux/man-pages/man7/io_uring.7.html
[4] “What is io_uring?” Lord of the io_uring documentation. https://unixism.net/loti/what_is_io_uring.html
[5] “epoll.” Wikipedia. https://en.wikipedia.org/wiki/Epoll
[6] “io_uring.” Wikipedia. https://en.wikipedia.org/wiki/Io_uring
[7] Corbet, Jonathan. “Ringing in a new asynchronous I/O API.” LWN.net, January 2019. https://lwn.net/Articles/776703/