blk-mq

Linux's multi-queue block I/O layer, replacing the legacy single-queue path; uses per-CPU software queues and per-device hardware queues to scale to modern NVMe parallelism.

also known as blk-mq · multi-queue-block · blk_mq

stack kernel · storage

blk-mq is Linux’s multi-queue block I/O layer, introduced around kernel 3.13 and made the default path with the retirement of the legacy single-queue layer. It was created because NVMe SSDs can process hundreds of thousands of IOPS in parallel across dozens of hardware queues, and the single-queue code path (with its shared request queue lock) was the scaling bottleneck.

Architecturally, blk-mq splits the block layer into two tiers of queues:

Software queues — one per CPU. An I/O submitted from CPU N lands in CPU N’s software queue without a cross-CPU lock.
Hardware queues — one (or more) per storage device, mapped to the device’s actual hardware queues. The blk-mq core runs the software queue into the hardware queue assigned to the submitting CPU.

This design eliminates the cross-CPU queue lock from the hot path entirely, which is why modern NVMe can sustain millions of IOPS per device without kernel being the bottleneck.

Tunables of interest:

/sys/block/<dev>/queue/scheduler — picks the I/O scheduler (none, mq-deadline, kyber, bfq). none is the right default for NVMe; scheduling overhead dominates at millions of IOPS.
/sys/block/<dev>/queue/nr_requests — depth per queue.
/sys/block/<dev>/queue/io_poll and io_poll_delay — enable polled completions instead of interrupts, lowering tail latency on modern NVMe.

blk-mq is the substrate io_uring sits on for file and block I/O. Understanding the queue mapping (software queue on submit CPU to hardware queue tag) is essential for reasoning about NVMe latency tails.

related

sources