HN Debrief

Epoll vs. io_uring in Linux

  • Infrastructure
  • Programming
  • Security
  • Developer Tools

The post compares epoll and io_uring through the lens of a small reverse proxy project. It explains the old Linux model of waiting for readiness with epoll versus io_uring’s newer submission and completion rings, where user space and the kernel share queues to cut syscall overhead and support operation chaining. The comments mostly landed on a blunt conclusion: the API choice matters, but it is rarely the first bottleneck. Several people pointed out that a proxy can leave huge performance on the table through cross-core traffic, cache contention, allocator choices, and NIC queue layout long before epoll versus io_uring decides the outcome. That is why raw “CPU went up” or “req/s improved” anecdotes were treated with suspicion unless they also came with latency and max-load behavior.

If you run latency-sensitive or very high-throughput Linux services, treat io_uring as a targeted optimization, not an automatic upgrade. Benchmark end-to-end with your actual workload, and check early whether your kernels, containers, and security policies even allow it before you redesign around it.

Discussion mood

Interested and cautiously positive. People liked the walkthrough and generally agreed io_uring can be faster, but they pushed back on simplistic conclusions and kept steering the conversation toward workload-specific measurement, system architecture, and real deployment constraints like sandboxing and kernel security.

Key insights

  1. 01

    CPU and NIC affinity can dwarf API gains

    Aligning threads, listen sockets, and packet flow to specific CPUs can remove cross-core handoffs that kill a proxy’s throughput. The concrete claim was that on large multi-queue NIC systems, receive-side scaling, socket affinity like SO_INCOMING_CPU, and avoiding false sharing can produce order-of-magnitude gains that make epoll versus io_uring look like a second-order choice.

    Before rewriting around io_uring, profile how packets move across cores and NIC queues. On multi-core servers, add CPU pinning and data-layout checks to the benchmark plan, because that may buy more capacity than an API migration.

      Attribution:
    • toast0 #1 #2 #3
    • camkego #1
  2. 02

    Higher CPU usage is not a regression by itself

    A busier CPU after switching to io_uring can mean the machine is spending less time in kernel overhead and more time doing useful work. The useful yardsticks are throughput, tail latency, and behavior at saturation, not whether htop shows a larger percentage. That correction matters because io_uring’s job is often to trade waiting and syscall churn for more active work.

    When you compare epoll and io_uring, collect p99 latency, max throughput, and system versus user CPU time. Do not treat lower CPU utilization as the win condition for an I/O stack.

      Attribution:
    • vlovich123 #1 #2 #3
    • saghm #1
    • FooBarWidget #1
    • toast0 #1
    • topspin #1
  3. 03

    Most async frameworks blunt io_uring’s advantages

    Libraries built around a poll-style event loop often treat io_uring as just another readiness backend. That misses the features that make it interesting, especially linked operations and low-syscall execution paths. The result is a common trap where swapping backends raises complexity and CPU cost without unlocking the design changes required for real gains.

    If your stack hides I/O behind a poll-shaped abstraction, expect limited benefit from flipping on io_uring. Check whether the framework can express chained operations and completion-driven flows before betting on benchmark wins.

      Attribution:
    • Asmod4n #1
    • MathMonkeyMan #1
  4. 04

    io_uring is broader than socket multiplexing

    The useful framing is not just “faster epoll.” io_uring can cover non-socket interfaces that have poor or no non-blocking user APIs, and it can express sequences of operations as one pipeline. That makes it attractive for file, storage, device, and mixed I/O paths even when pure network readiness handling sees only small gains.

    Look at io_uring first in code paths that mix network and file or device I/O, or where you need operation chaining. The value is often bigger there than in a clean socket-only event loop.

      Attribution:
    • Cloudef #1
    • lukeh #1
    • kshri24 #1
  5. 05

    Security support is ahead of deployment reality

    Per-operation filtering for io_uring now exists, but it is too new to count on in the environments most companies actually run. Enterprise kernels, seccomp policies, and container runtimes still lag, and recent vulnerability history means many platforms will keep blocking io_uring for a while even if the upstream kernel story improves.

    Treat io_uring availability as a deployment dependency, not a code dependency. Verify kernel versions, sandbox policy, and container runtime support before committing to it in a product roadmap.

      Attribution:
    • insanitybit #1 #2
    • Asmod4n #1 #2
    • cyphar #1
    • mort96 #1
  6. 06

    Busy-poll epoll is a serious low-latency option

    For dedicated proxy boxes, epoll-based busy polling tied to NAPI contexts can push latency down without jumping all the way to DPDK or a full kernel-bypass design. That puts a useful middle ground on the table for teams that need better packet responsiveness but cannot absorb the complexity of user-space networking stacks.

    If your goal is lower network latency rather than a general async rewrite, test epoll busy polling and NAPI-aware worker placement. It may get you close enough without the operational cost of DPDK or AF_XDP.

      Attribution:
    • buybackoff #1

Against the grain

  1. 01

    Fast servers are not defined by the API

    A well-built server can perform well with either readiness multiplexing or completion-based async I/O, because implementation quality dominates in most real systems. The useful cross-platform perspective was that Windows has long had interfaces like Registered I/O, yet Linux servers were not somehow uncompetitive before io_uring arrived.

    Do not let the existence of a newer kernel API force a rewrite narrative. If your existing epoll design is sound, demand evidence from your workload before paying the migration cost.

      Attribution:
    • up2isomorphism #1
    • RossBencina #1
    • muststopmyths #1
  2. 02

    The real next step may be kernel bypass

    Once you push hard on packets per second, the limiting factor can become the Linux network stack itself rather than epoll or io_uring. At that point, features like GSO and GRO help, but AF_XDP, DPDK, or even FPGA paths become the relevant comparison set if raw performance is the goal.

    If you are already near line-rate networking limits, benchmark against AF_XDP or DPDK instead of assuming io_uring is the endgame. That changes the engineering tradeoff from API design to operational complexity and hardware tuning.

      Attribution:
    • Cloudef #1
    • gafferongames #1
    • inigyou #1

In plain english

AF_XDP
A Linux socket family for high-performance packet processing that can bypass much of the normal network stack.
DPDK
Data Plane Development Kit, a set of user-space libraries and drivers for very high-speed packet processing outside the normal kernel network stack.
epoll
A Linux kernel API that lets a program wait for many file descriptors, such as sockets, to become ready for reading or writing.
false sharing
A performance problem where different CPU cores modify separate data that happens to sit on the same cache line, causing extra cache traffic.
GRO
Generic Receive Offload, a Linux networking feature that combines incoming packets to reduce per-packet processing overhead.
GSO
Generic Segmentation Offload, a Linux networking feature that lets large packets be split efficiently later in the stack or by hardware.
io_uring
A Linux kernel interface that uses shared submission and completion queues so programs can submit I/O work and receive results with less syscall overhead.
NAPI
New API, the Linux mechanism that manages how network drivers switch between interrupt-driven and polling-based packet processing.
NIC
Network interface card, the hardware that connects a machine to a network.
receive-side scaling
A network hardware and driver technique that spreads incoming packets across multiple CPU cores and NIC queues.
Registered I/O
A Windows networking API, often shortened to RIO, designed for high-performance asynchronous socket I/O.
reverse proxy
A server that sits in front of other servers, accepts client requests, and forwards them to backend services.
seccomp
Secure computing mode, a Linux feature that restricts which system calls a process is allowed to make.
SO_INCOMING_CPU
A Linux socket option that helps associate incoming socket processing with a specific CPU.
syscall
A request from a user-space program to the operating system kernel to perform a privileged operation.
tail latency
The slow end of the latency distribution, often measured as p95 or p99 response time rather than the average.

Reference links

Related writeups and examples

Low-latency networking and busy polling

Performance tooling and libraries

  • Concurrency Kit
    Suggested as a low-level concurrency library for building a high-performance proxy.
  • mimalloc
    Suggested allocator option for aligned and efficient memory use in networking code.
  • libxdp documentation
    Suggested for adding DDoS protection and more advanced layer 4 packet handling.

Support and security references