PCIe TLP

Transaction Layer Packet — the unit of communication over a PCI Express link, carrying memory reads/writes, completions, and messages between CPU root complex and devices.

also known as pcie-tlp · transaction-layer-packet

stack bus / dma · storage

A TLP (Transaction Layer Packet) is the unit of communication at the top of the PCI Express protocol stack. Every memory read, memory write, configuration access, and message between the CPU root complex and a PCIe device is encoded as one or more TLPs. TLPs carry a header (type, address, size, tag, attributes) and optional data payload; the payload is split into chunks that match the link’s maximum payload size (MPS), typically 128–512 bytes on modern servers.

Performance-relevant TLP facts:

MPS negotiation: every device in the hierarchy negotiates the smallest MPS. A 128-byte default hurts effective throughput; servers that matter tune MPS to 256 or 512 bytes.
Read request size (MRRS) — separate from MPS, governs the largest memory read a device can request. Bigger MRRS means fewer round-trips on DMA reads.
Completions split — a single read request may be serviced by multiple completion TLPs, each with its own header overhead. The effective bandwidth is always less than wire rate.
Per-TLP overhead is ~20–30 bytes of header + framing, which caps small-payload efficiency. A PCIe Gen5 x16 link has 64 GB/s of wire bandwidth, but effective NVMe throughput is 14 GB/s sequential because of overhead and queue depth effects.

Why it shows up in lowlat.ms territory:

NVMe latency: each 4 KB read is a queue of TLPs — doorbell write, completion notification, DMA burst. Understanding that structure explains why small random reads are the hardest case.
GPU Direct Storage / NVMeoF: bypass the CPU and let a device DMA directly from/to another device over PCIe.
PCIe switch topology affects latency; each switch hop adds TLP traversal cost.

related

sources