HN Debrief

We found a bug in the hyper HTTP library

  • Programming
  • Infrastructure
  • Open Source
  • Developer Tools

Cloudflare’s post walks through a six-week hunt for a bug in hyper, the widely used Rust HTTP library. Under a narrow set of HTTP/1 conditions, hyper could decide a response was finished before all buffered bytes had actually been flushed to the socket. If the client was slow enough to fill the kernel send buffer, `poll_flush` returned `Poll::Pending`, that result got discarded, and hyper shut down the connection early. The visible symptom was rare truncated responses. The eventual fix was tiny. The path to finding it was not.

If you run async Rust in production, tighten your lint policy around ignored `must_use` values and `let _ =` immediately. More broadly, treat graceful shutdown and backpressure paths as failure-prone code, because that is where “safe” stacks still lose data.

Discussion mood

Respectful of the writeup and unsurprised by the bug’s difficulty, with a slightly smug undercurrent about Rust not being magic. The strongest reactions focused on missed lint coverage and on the gap between Rust’s safety story and ordinary logic bugs in async I/O.

Key insights

  1. 01

    Clippy already had a likely tripwire

    Clippy can warn on exactly the pattern that hid this bug. `let_underscore_untyped` and `let_underscore_must_use` would have forced attention onto a `Poll` value being discarded, and `Poll` is already marked `#[must_use]`. The miss was not that Rust lacked any signal. It was that the useful signal lives in opt-in lint policy rather than the default compiler experience.

    Audit your Rust CI settings, not just your compiler version. Turn on stricter Clippy rules for ignored results and make exceptions explicit so reviewers see them.

  2. 02

    The type system never modeled buffer state

    The deeper failure was that the API did not encode enough about what flushing means for buffered data. Once `poll_flush` can return a nonterminal state that still requires the caller to preserve internal obligations, the programmer is carrying protocol correctness in their head. Lints help, but they are standing in for missing structure in the API.

    When you design async interfaces, look for places where callers must remember invisible state transitions. If an API can leave work half-done, expose that in types or control flow instead of relying on comments and conventions.

      Attribution:
    • Twey #1
    • jerf #1
  3. 03

    This was a state-machine bug first

    Calling this a race condition is not wrong, but it obscures the practical lesson. The bug reads more like an invalid transition in the HTTP connection state machine, where partial flush got treated as completed work and shutdown became legal too early. Framing it that way points attention to transition invariants, not to vague fear about concurrency.

    Review async networking code as a protocol state machine. Check every path that moves from write to flush to close, especially when `Pending` is a normal outcome rather than an error.

      Attribution:
    • tetha #1
    • inexcf #1
    • wongarsu #1
  4. 04

    The repro story still leaves edge cases unclear

    One sharp question was why `curl --http1.1` with `Connection: close` did not seem to trigger the issue more reliably. If closing after the body is common and kernel buffers are finite, the bug sounds like it should show up more often than the writeup implies. That is a useful reminder that production-only failures usually depend on extra conditions the simplified narrative omits.

    Do not stop at the postmortem’s headline condition. For your own systems, keep asking what additional timing, buffering, or body-size constraints are required before you trust a repro model.

      Attribution:
    • nopurpose #1

Against the grain

  1. 01

    Response sampling may be harder than it sounds

    The complaint that Cloudflare should have noticed broken responses earlier ran into a practical challenge. Detecting truncation from live traffic without capturing sensitive payloads is not obviously easy at CDN scale, especially when the failure is rare and content can be customer data. That weakens the easy assumption that better observability would have made this trivial.

    If you depend on edge or proxy infrastructure, design privacy-safe integrity signals ahead of time. Counting bytes, checksums, and end-to-end validation markers are easier to add before an incident than during one.

      Attribution:
    • ramon156 #1
  2. 02

    Rust may still be a huge step change

    Against the usual 'Rust didn’t prevent this, therefore it is overhyped' line, one comment argued that Fred Brooks used 'silver bullet' to mean order-of-magnitude improvement, not perfection. By that standard, Rust could still qualify even while leaving logic bugs like this intact. The point is not that Rust solves everything. It is that reducing one giant class of failures can still be transformative.

    Judge language choices by what categories of bugs and operating cost they shrink in aggregate. Do not let one postmortem erase real gains in memory safety and reliability.

      Attribution:
    • tialaramex #1

In plain english

async
A programming model where operations can pause and resume later without blocking an entire thread.
CDN
Content delivery network, a distributed system of servers used to deliver web content quickly to users.
Clippy
Rust’s linter tool, which adds extra warnings and style or correctness checks beyond the compiler’s defaults.
HTTP/1
Hypertext Transfer Protocol version 1.x, an older widely used version of the web protocol that runs over a single TCP connection.
hyper
A popular HTTP library in Rust used to build clients and servers.
must_use
A Rust annotation that warns when a returned value should not be ignored because it likely carries important information.
Poll::Pending
A Rust async return state meaning the operation cannot complete now and must be tried again later.
poll_flush
An async I/O operation that tries to push buffered output toward the underlying connection and may report that it is not finished yet.
state machine
A model where code moves through a defined set of states and only certain transitions are valid.

Reference links

Primary incident references

Related protocol and server behavior

Background explainer