Task Failed Successfully: Saturating NIC and Disk Bandwidth

Infrastructure
Hardware
Programming

The post is a case study in chasing a hidden systems bottleneck. The author was trying to saturate a NIC while reading from NVMe with io_uring and RDMA. Early fixes removed the obvious overhead from pinning pages for each I/O with READ_FIXED, but the full deployment still stalled around half of expected throughput. The write-up walks through the dead ends first. io-wq backlog was not the limiter. Request splitting was not the limiter. File descriptor lookup and CRC work were not the limiter either. What finally broke the case open was that the workload kept scanning roughly 1 MiB buffers backed by ordinary 4 KiB pages, which drove heavy dTLB miss cost. Moving the read arena to hugepages got throughput close to NIC saturation.

If a data path looks "CPU-bound" while disks and NICs are both under target, treat virtual memory behavior as a first-class suspect. Add top-down profiling and page-size experiments early, especially for large sequential buffers or zero-copy style pipelines.

June 27, 2026
blog.mrcroxx.com
Discuss on HN

Key insights

Top-down profiling would have exposed VM stalls

Top-down CPU analysis changes the story from "CPU-bound" to "front end and memory system are stalling on address translation." Looking at IPC first would have shown that the core was not retiring much useful work, which points you toward virtual memory costs and TLB pressure instead of checksum code or syscall overhead.

When perf says a workload is burning CPU, do not stop there. Check IPC and stall breakdowns early so you can separate real compute from translation and memory-system overhead.

Attribution:

jeffbee #1

Peer-to-peer DMA could remove another hop

Using P2PDMA would let the NVMe device DMA data straight to the NIC instead of staging it through host memory. That does not answer the post’s immediate TLB problem, but it points to a more aggressive design where the entire data path avoids some memory traffic and CPU involvement, assuming the RDMA and CRC pieces can still work with that layout.

If your product lives on storage-to-network transfer, do not stop at page-size tuning. Check whether your hardware and kernel support direct device-to-device DMA, then evaluate the operational complexity it adds.

Attribution:

nycerrrrrrrrrr #1

Against the grain

Hugepages alone is not proof of AI insight

Guessing hugepages for a high-throughput data pipeline is basic systems folklore, so landing on the right knob does not mean the diagnosis was good. The useful part of debugging is explaining why the change works and ruling out nearby causes, because that is what transfers to the next failure instead of becoming cargo-cult tuning.

Treat AI suggestions like a rough checklist, not a root-cause analysis. Keep the evidence trail, because the explanation is what lets your team reuse the result safely.

Attribution:

ozgrakkurt #1

In plain english

CRC ↩

Cyclic Redundancy Check, an error-detection calculation used to verify data integrity.

dTLB ↩

Data Translation Lookaside Buffer, a small CPU cache that stores recent virtual-to-physical memory address translations for data access.

hugepages ↩

Memory pages much larger than the normal 4 KiB size, used to reduce address-translation overhead.

io-wq ↩

The io_uring worker-thread subsystem used when operations cannot be completed fully asynchronously.

io_uring ↩

A Linux interface for high-performance asynchronous input and output operations.

IPC ↩

Instructions per cycle, a CPU performance metric showing how much useful work the processor retires each clock cycle.

KiB ↩

Kibibyte, a unit of data equal to 1,024 bytes.

NIC ↩

Network interface card, the hardware that connects a machine to a network.

NVMe ↩

Non-Volatile Memory Express, a high-speed storage protocol commonly used for solid-state drives over PCI Express.

P2PDMA ↩

Peer-to-peer Direct Memory Access, a way for one PCI Express device to transfer data directly to another without routing it through main memory.

PCIe ↩

Peripheral Component Interconnect Express, the high-speed bus used to connect devices like NVMe drives and network cards to a computer.

RDMA ↩

Remote Direct Memory Access, a networking method that lets one machine access memory on another with very low CPU overhead.

READ_FIXED ↩

An io_uring mode that uses pre-registered memory buffers so the kernel does not need to pin pages for each read operation.

TLB ↩

Translation Lookaside Buffer, a CPU cache that speeds up virtual memory address translation.

Task Failed Successfully: Saturating NIC and Disk Bandwidth

Discussion mood

Key insights

Against the grain

In plain english