The post is a build-and-tuning writeup for a home inference machine that pairs an RTX 5080 with an RTX 3090 to run Qwen 3.6 27B Q8 at about 80 tokens per second. The setup matters because 27B-class open models are now crossing from novelty into genuinely usable local coding and agent workflows, especially when you stack quantization with MTP speculative decoding and enough VRAM to keep a large context in play.
People using similar hardware said the headline number is believable and not even unique. Several reported 60 to 80
tok/s on other budget multi-GPU builds, including old X99 boards and dual 3080 20 GB cards. Others pointed out that the result is highly sensitive to inference settings, model variant, and backend details, so the post reads more like one successful recipe than a general performance law. The most technical comments focused on sampler settings, MTP draft depth,
n-gram speculative decoding, and whether
peer-to-peer GPU paths are actually enabled.
The more interesting signal was about why anyone wants this in the first place. A lot of people said local Qwen is now good enough for routine coding, knowledge-heavy tasks, and tightly scoped agents, even when Claude or ChatGPT still wins on open-ended reasoning. The appeal is not just avoiding API bills. It is stable behavior, visible failure modes, control over context and internals, privacy, and the ability to build workflows that would be awkward through a hosted API. The ceiling is still lower for long reasoning chains and ambitious tasks, and several people flatly said the cloud is cheaper or faster for many workloads. Even so, the center of gravity has moved. Local inference is no longer a toy project for enthusiasts with DGX-class budgets. It is becoming a practical second lane for developers who want an owned, customizable model stack beside the frontier APIs.