Surprising Economics of Load-Balanced Systems

Infrastructure
Engineering
Performance

The post walks through a classic load-balancing and queueing result from Erlang-style models. Under a clean setup with Poisson arrivals, exponential service times, and effectively infinite waiting room, splitting work across more identical servers improves mean response time more than many people expect. The point is not that throughput becomes magical. It is that queueing delay falls sharply when utilization is spread out, because waiting time explodes nonlinearly as any single queue gets busy.

Use the article as intuition, not sizing guidance. For real systems, test with your own traffic traces, retries, and burst patterns, and decide explicitly where to queue, where to shed load, and where async design is cheaper than overprovisioning.

June 19, 2026
brooker.co.za
Discuss on HN

Discussion mood

Mostly positive on the queueing intuition, but skeptical of the article's framing and real-world applicability. People were comfortable with the math as a teaching model and uncomfortable with how much production complexity it leaves out, especially correlated bursts, retries, seasonality, and queue placement.

Key insights

Correlated bursts dominate real outages

Retries, thundering herds, and event-driven spikes like major sports broadcasts break the independence assumption that makes the Erlang result look clean. That shifts the problem from average-case efficiency to peak survivability, which is why many teams still overprovision synchronous systems or redesign them so clients absorb delay asynchronously.

If your biggest incidents come from synchronized demand, do not size from mean arrival rates. Model retry storms and coordinated spikes directly, then decide whether async workflows can remove pressure before you buy more headroom.

Attribution:

bijowo1676 #1
crypttales #1

Simulation beats elegant closed forms

Short windows may justify Poisson-style approximations, but production traffic usually has daily cycles, non-stationary behavior, and heavy correlations that the simple model cannot capture. Running simulations with real traces or stronger traffic models gets you answers that are much closer to what your system will actually face, and the tooling cost is now low enough that this should be routine.

Add traffic simulation to capacity planning instead of arguing from abstract queueing results alone. Replay real request patterns before changing fleet size, autoscaling rules, or load-balancer policy.

Attribution:

mjb #1

Queue placement changes the latency tradeoff

The missing practical question is not whether queues exist, but where and how large they should be. In the toy model, the load balancer already holds an infinite queue, so inserting another one just adds waiting. In production, limiting queue length and shedding excess work can cut tail latency even if it lowers completion rate under stress.

Treat queue length as a product decision, not a default. Set explicit bounds and overload behavior so bad bursts fail fast instead of silently stretching user-visible latency.

Attribution:

megamalloc #1
mjb #1

The surprise is mostly about latency curves

People expect 'more servers means linearly better' when they think in throughput, because throughput does scale roughly that way. The unintuitive part shows up when you graph mean or tail response time, where queueing delay is highly nonlinear near saturation. That is why the result feels surprising even though the capacity story is mundane.

When explaining capacity to non-specialists, show utilization and latency curves together. Throughput charts alone hide the operational risk that appears long before you hit full capacity.

Attribution:

mjb #1
nilsherzig #1
physix #1

Against the grain

The article oversells a mundane point

Some readers thought the post manufactured surprise by contrasting against an implausible wrong answer and by using dramatic framing around a standard queueing intuition. That does not refute the math, but it weakens the piece as technical communication because it spends energy on spectacle instead of on the assumptions and boundaries that actually matter.

Be careful using this post to teach teammates. Pair it with a plain explanation of the model and its limits so people do not come away thinking they learned a general law of distributed systems.

Attribution:

PunchyHamster #1
antonvs #1
bigcat12345678 #1

In plain english

Erlang ↩

Here, the classic queueing mathematics developed by Agner Krarup Erlang, a pioneer in telephone traffic modeling.

load shedding ↩

Intentionally dropping or rejecting some work when a system is overloaded so the remaining work can still complete quickly.

non-stationary ↩

A process whose statistical behavior changes over time rather than staying constant.

Poisson arrivals ↩

A mathematical model where requests arrive randomly and independently at a steady average rate.

tail latency ↩

The slow end of response times, such as the 95th or 99th percentile, which often matters more to users than the average.

Reference links

Simulation tools

stability-sim.systems
Suggested as a practical way to test queueing behavior with real traffic patterns or stronger models than the blog post uses.