blk-mq
Linux's multi-queue block I/O layer, replacing the legacy single-queue path; uses per-CPU software queues and per-device hardware queues to scale to modern NVMe parallelism.
blk-mq is Linux’s multi-queue block I/O layer, introduced around kernel 3.13 and made the default path with the retirement of the legacy single-queue layer. It was created because NVMe SSDs can process hundreds of thousands of IOPS in parallel across dozens of hardware queues, and the single-queue code path (with its shared request queue lock) was the scaling bottleneck.
Architecturally, blk-mq splits the block layer into two tiers of queues:
- Software queues — one per CPU. An I/O submitted from CPU N lands in CPU N’s software queue without a cross-CPU lock.
- Hardware queues — one (or more) per storage device, mapped to the device’s actual hardware queues. The blk-mq core runs the software queue into the hardware queue assigned to the submitting CPU.
This design eliminates the cross-CPU queue lock from the hot path entirely, which is why modern NVMe can sustain millions of IOPS per device without kernel being the bottleneck.
Tunables of interest:
/sys/block/<dev>/queue/scheduler— picks the I/O scheduler (none,mq-deadline,kyber,bfq).noneis the right default for NVMe; scheduling overhead dominates at millions of IOPS./sys/block/<dev>/queue/nr_requests— depth per queue./sys/block/<dev>/queue/io_pollandio_poll_delay— enable polled completions instead of interrupts, lowering tail latency on modern NVMe.
blk-mq is the substrate io_uring sits on for file and block I/O. Understanding the queue mapping (software queue on submit CPU to hardware queue tag) is essential for reasoning about NVMe latency tails.