TL;DR

Blocking pipe I/O completes a write-read cycle in 940 ns median on the test system, while epoll adds an extra notification step bringing the median to 1,120 ns—a ~19% overhead per operation. However, epoll’s advantage emerges at scale: its $O(1)$ readiness notification remains flat from 1 to 1,000 monitored file descriptors. io_uring takes this further by eliminating most syscalls entirely through shared ring buffers, with Axboe’s reference benchmarks reporting 1.7M 4K IOPS in polled mode [1].

Problem Statement

Every network server and storage engine must choose an I/O model. The choice determines how many syscalls occur per operation, how the application scales across connections, and the floor on per-operation latency. Linux provides three primary models that represent successive generations of I/O interface design: blocking I/O, epoll, and io_uring. Each makes different tradeoffs between simplicity, syscall overhead, and scalability.

Syscalls are not free. Each user-to-kernel transition requires saving register state, switching privilege levels, and—on systems with Kernel Page Table Isolation (KPTI) enabled as a Meltdown mitigation—flushing and switching page tables [5]. On the test system, even the minimal getpid() syscall costs 455 ns median. For I/O-intensive workloads processing millions of operations per second, the cumulative cost of syscalls becomes a dominant factor.

This post examines the architectural differences between the three models, measures their per-operation costs through pipe I/O benchmarks, and analyzes where each model provides advantages.

Background: Three I/O Models

Blocking I/O

The simplest model. A thread calls read() or write() and blocks until the operation completes. Each I/O operation requires exactly one syscall. To handle multiple connections simultaneously, the application typically spawns one thread per connection.

The model is straightforward to reason about: control flow proceeds linearly. The cost is one syscall per operation, plus the overhead of thread scheduling and context switching when managing many connections. With thousands of connections, thread stack memory consumption and scheduler overhead become bottlenecks.

epoll

Introduced in Linux kernel 2.5.45 (October 2002), epoll replaced the older select(2) and poll(2) interfaces [5]. Where select and poll operate in $O(n)$ time—scanning every monitored file descriptor on each call—epoll uses an in-kernel red-black tree to track registered file descriptors and maintains a ready list, achieving $O(1)$ notification for ready events [2][5].

The epoll API consists of three syscalls:

  • epoll_create1() — creates an epoll instance
  • epoll_ctl() — adds, modifies, or removes file descriptors from the interest list
  • epoll_wait() — blocks until one or more file descriptors become ready

A single thread can monitor thousands of file descriptors. When epoll_wait returns, the application then issues the actual read() or write() syscalls on the ready descriptors. This means epoll requires at minimum two syscalls per I/O event: one epoll_wait plus one data-transfer syscall. When epoll_wait returns multiple ready events, the notification cost is amortized.

epoll supports both level-triggered (default) and edge-triggered (EPOLLET) modes [2]. Edge-triggered mode delivers events only when the state of a file descriptor changes, reducing redundant notifications but requiring the application to drain all available data on each event.

io_uring

Adopted in Linux kernel 5.1 (2019) and designed by Jens Axboe, io_uring addresses the fundamental limitation shared by all prior Linux I/O interfaces: the requirement for at least one syscall per I/O submission [1][6].

io_uring communicates between user space and the kernel through two shared ring buffers mapped into the application’s address space via mmap() [3][4]:

  • Submission Queue (SQ): The application writes Submission Queue Entries (SQEs) describing desired I/O operations.
  • Completion Queue (CQ): The kernel writes Completion Queue Events (CQEs) with the results of completed operations.

Because both queues reside in shared memory, the application can submit multiple I/O requests by writing SQEs to the ring buffer and then making a single io_uring_enter() syscall—or, in SQPOLL mode, no syscall at all. In SQPOLL mode, a dedicated kernel thread continuously polls the submission queue for new entries, eliminating the submission syscall entirely [1][7].

The syscall-per-operation cost for each model breaks down as follows:

ModelSyscalls per I/O OperationNotes
Blocking1read() or write()
epoll (single event)2epoll_wait() + read()/write()
epoll (N events batched)$(N+1)/N$Amortized across batch
io_uring (batched)$1/N$One io_uring_enter() for N operations
io_uring (SQPOLL)0Kernel thread polls SQ

Methodology

Hardware: Intel Xeon Processor @ 2.60 GHz (x86_64) Kernel: Linux 6.1.158, SMP PREEMPT_DYNAMIC Runtime: Python 3.13.12 (CPython) Timing: time.perf_counter_ns() (nanosecond resolution) I/O Primitive: Unix pipes (os.pipe()) Message Size: 64 bytes (default), also tested at 256, 1024, 4096, and 16384 bytes Repetitions: 30 independent runs per configuration, 5,000 operations per run (150,000 total operations per config for per-run medians; 25,000 aggregate samples for distribution statistics)

The benchmarks use Python’s os.read(), os.write(), and select.epoll() wrappers, which add a constant overhead to every measurement from CPython function call dispatch and object allocation. This overhead affects absolute numbers but does not change the relative comparison between models, as all paths go through the same Python layer before reaching the underlying syscall.

io_uring benchmarks could not be executed directly in the sandbox due to the absence of liburing Python bindings and the restricted syscall environment. io_uring performance data is drawn from cited external sources and clearly distinguished from measured results.

Experiment

Five benchmark configurations were executed:

  1. Syscall baseline: 100,000 iterations of getpid() via ctypes to measure bare syscall cost.
  2. Blocking pipe I/O: os.write() + os.read() of 64-byte messages through a Unix pipe. Measures the cost of two syscalls with actual data transfer.
  3. epoll pipe I/O: Data written to pipe, then epoll.poll() + os.read() timed as the “receive path.” Measures the cost of readiness notification plus data transfer.
  4. epoll FD scalability: 1 to 1,000 monitored file descriptors with single-event readiness. Measures whether epoll maintains $O(1)$ behavior.
  5. epoll batching: 1 to 64 pipes made ready simultaneously, measuring amortized per-event cost.

Each configuration was run 30 times with fresh pipe file descriptors per run.

Results

Per-Operation Latency

Bar chart comparing per-operation latency of getpid syscall (455ns median), blocking pipe I/O (940ns median), and epoll pipe I/O (1120ns median) with P99 values
Per-operation latency for blocking vs epoll pipe I/O on 64-byte messages. Blocking I/O avoids the epoll_wait notification step, yielding lower single-connection latency.
Metricgetpid()Blocking (write+read)epoll (wait+read)
Median455 ns940 ns1,120 ns
Mean488 ns982 ns1,518 ns
Stddev673 ns736 ns1,980 ns
P99862 ns1,354 ns3,381 ns

Blocking I/O shows a 19% lower median and a 60% lower P99 compared to epoll for single-connection pipe operations. The epoll path’s higher variance (stddev of 1,980 ns vs 736 ns) and elevated P99 indicate that epoll_wait() introduces additional tail latency, consistent with the extra kernel-side ready-list processing.

Single-Connection Throughput

Horizontal bar chart showing blocking I/O at 981,583 ops/sec and epoll at 580,817 ops/sec for single pipe 64-byte operations
Single-connection throughput comparison. Blocking I/O achieves 1.69× higher throughput than epoll for single-fd scenarios due to lower per-operation overhead.

For a single pipe with 64-byte messages:

  • Blocking: 981,583 ops/sec (median across 30 runs)
  • epoll: 580,817 ops/sec (median across 30 runs)

Blocking I/O achieves 1.69× the throughput of epoll in this single-connection scenario. This result is expected: epoll’s purpose is not to accelerate individual connections but to multiplex many connections on a single thread.

epoll Scalability

Line chart showing epoll_wait latency staying nearly flat from 1210ns at 1 FD to 1341ns at 1000 FDs, compared to an O(n) reference line
epoll notification latency remains nearly constant as the number of monitored file descriptors scales from 1 to 1,000, confirming O(1) readiness notification.
Monitored FDsepoll_wait + read Median (ns)
11,210
101,232
1001,295
5001,343
1,0001,341

The latency increase from 1 to 1,000 monitored file descriptors is 131 ns (10.8%), confirming epoll’s $O(1)$ readiness notification behavior. This is the property that made epoll the standard for high-connection-count servers, replacing select() and poll() which scale as $O(n)$ in the number of monitored descriptors [5].

Batching Amortization

Line chart showing amortized cost per I/O event decreasing from 1334ns at batch size 1 to 549ns at batch size 64
Amortized per-event cost when multiple events are ready per epoll_wait call. Batching amortizes the syscall overhead, a pattern io_uring exploits further through its ring buffer design.
Batch SizeAmortized Cost per Event (ns)Reduction vs Batch=1
11,334
475843%
1660754%
6454959%

When 64 events are ready per epoll_wait call, the amortized per-event cost drops to 549 ns—a 2.4× reduction from the unbatched case. This demonstrates the principle that io_uring exploits to its logical conclusion: amortizing the syscall cost across as many operations as possible.

Syscall Count by Model

Bar chart comparing syscall count per 100 I/O operations: blocking=100, epoll batch 1=200, epoll batch 10=110, io_uring batched=1, io_uring SQPOLL=0
System call count per 100 I/O operations across different models. io_uring with SQPOLL mode eliminates syscalls entirely by using a kernel-side polling thread.

Analysis

Syscall Cost Decomposition

The measured getpid() cost of 455 ns provides a baseline for a minimal syscall that performs no I/O. The blocking write+read pair at 940 ns for two syscalls yields approximately 470 ns per syscall—consistent with the getpid() baseline when accounting for the small amount of pipe buffer management the kernel performs.

The epoll path’s 1,120 ns for epoll_wait + read (two syscalls) shows that epoll_wait is slightly more expensive than a simple read, which is expected given that the kernel must check the ready list and return event information.

Where Blocking Wins

For single-connection or low-connection-count scenarios, blocking I/O delivers the lowest per-operation latency. Applications with a known, small number of connections—such as a database client with a connection pool of 8-16 connections—may achieve better latency with a thread-per-connection model than with epoll, provided thread count remains manageable.

Where epoll Wins

epoll’s advantage materializes with many concurrent connections. A single thread can monitor thousands of file descriptors with near-constant overhead. The $O(1)$ scaling measured here—1,210 ns at 1 FD to 1,341 ns at 1,000 FDs—means that a server handling 10,000 connections pays roughly the same notification cost as one handling 10.

The batching results further demonstrate that under high load (many simultaneous ready events), epoll’s amortized per-event cost approaches the cost of a single read() syscall, as the epoll_wait cost is shared across all ready events.

Where io_uring Wins

io_uring’s architectural advantage is the elimination of per-operation syscalls. In standard mode, a single io_uring_enter() call can submit an arbitrary number of SQEs, reducing the syscall count to $1/N$ per operation [1][3]. In SQPOLL mode, the kernel thread polls the submission queue, and the application writes SQEs to shared memory without any syscall [1][7].

According to Axboe’s io_uring design paper, polled io_uring achieved 1.7M 4K IOPS on a reference system, compared to 608K IOPS for Linux AIO [4]. This 2.8× improvement comes from multiple factors: fewer syscalls, zero-copy submission via shared memory, and polled completion avoiding interrupt overhead.

io_uring also supports operations that epoll cannot batch: accept(), connect(), send(), recv(), fsync(), and many others can all be submitted as SQEs in a single batch [3]. With epoll, each of these requires a separate syscall after readiness notification.

Security Considerations

io_uring’s expanded kernel attack surface is a documented concern. Google’s security team reported that 60% of exploits submitted to their bug bounty program in 2022 targeted io_uring vulnerabilities [6]. As a result, io_uring has been disabled in Android, ChromeOS, Google’s production servers, and Docker’s default seccomp profile [6]. Applications in security-sensitive environments should weigh the performance benefits against this attack surface.

Limitations

  1. Pipe I/O only. These benchmarks measure Unix pipe latency, not socket or disk I/O. Network I/O involves additional kernel networking stack overhead that would change the absolute numbers, though the relative syscall cost comparison remains applicable.

  2. Python overhead. CPython adds constant overhead to every measurement (~100-200 ns per function call). The absolute numbers are higher than what a C program would measure. The relative comparisons between models remain valid because the Python overhead is consistent across all paths.

  3. No io_uring measurements. The sandbox environment lacks liburing bindings and may restrict the io_uring_setup syscall. All io_uring performance data is cited from external sources. A complete comparison would require native C benchmarks with liburing on bare metal.

  4. Single-threaded. All benchmarks run single-threaded. The blocking I/O model’s real-world performance with thousands of connections would require threads, introducing context-switch overhead not captured here.

  5. Virtualized environment. The sandbox runs on a virtualized Intel Xeon. Bare-metal systems would show lower absolute latencies and potentially different relative behaviors, particularly for KPTI overhead which interacts with hypervisor-level page table management.

  6. Message size. The 64-byte default message size is small. Larger messages shift the bottleneck from syscall overhead to data copy time, potentially reducing the relative difference between models.

Conclusion

The three Linux I/O models represent a progression in minimizing syscall overhead. Blocking I/O delivers the lowest per-operation latency (940 ns median) for single-connection workloads by requiring exactly one syscall per data-transfer operation. epoll trades a ~19% per-operation overhead (1,120 ns median) for $O(1)$ scalability across thousands of connections, with amortized per-event cost dropping to 549 ns when 64 events batch per epoll_wait call. io_uring completes the progression by moving I/O submission and completion to shared memory ring buffers, reducing syscalls to near zero in SQPOLL mode.

The choice between these models is not universal. Blocking I/O suits low-connection-count, latency-sensitive applications. epoll remains the standard for high-connection-count servers where thread-per-connection is impractical. io_uring offers the highest theoretical throughput for syscall-dominated workloads, with the caveats of greater API complexity and a larger kernel attack surface.

Further Reading

References

[1] Axboe, Jens. “Efficient IO with io_uring.” kernel.dk, 2019. https://kernel.dk/io_uring.pdf

[2] “epoll(7) — Linux manual page.” man7.org. https://man7.org/linux/man-pages/man7/epoll.7.html

[3] “io_uring(7) — Linux manual page.” man7.org. https://man7.org/linux/man-pages/man7/io_uring.7.html

[4] “What is io_uring?” Lord of the io_uring documentation. https://unixism.net/loti/what_is_io_uring.html

[5] “epoll.” Wikipedia. https://en.wikipedia.org/wiki/Epoll

[6] “io_uring.” Wikipedia. https://en.wikipedia.org/wiki/Io_uring

[7] Corbet, Jonathan. “Ringing in a new asynchronous I/O API.” LWN.net, January 2019. https://lwn.net/Articles/776703/