The post walks through an Erlang-style queueing result for a very clean setup: requests arrive randomly, service times are memoryless, and a load balancer feeds a pool of identical one-at-a-time workers. Under that model, splitting work across more servers gives much better latency behavior near high utilization than a lot of people intuit. Readers did not really dispute the math. They disputed how often this model matches production.
The strongest pushback was that real traffic is rarely independent. Retries, thundering herds, daily cycles, sports events, launches, and bugs create correlated bursts. That turns the problem from neat steady-state economics into peak handling and failure containment. Several people said the practical answer is not just more
autoscaling. It is asynchronous designs, intentional load shedding, and feature shedding so the system degrades instead of locking into a retry storm.
The other big correction was architectural. Many real cloud load balancers are not maintaining one global queue and dispatching work the instant a worker becomes free. They are often
stateless, random, connection-oriented, or layered on top of backends that already have their own queues and concurrency limits. In those systems, the article’s result is directionally interesting but not directly predictive. A few comments also flagged that exponential service times are convenient math, while production workloads often have heavy tails that worsen latency outliers. The practical landing point was clear: use queueing theory to build intuition, then simulate with real traffic traces and the actual queuing behavior of your stack.