RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Discussion mood

Positive and excited, with a strong tinkerer vibe. People were impressed that commodity and even older consumer GPUs can now deliver useful local inference speeds, while staying blunt that cloud models still win on convenience, open-ended capability, and often total cost.

Key insights

01
Smaller local models fail in cleaner ways

For routine coding work, Qwen often misses in ways that are easier to spot and recover from than Claude. It tends to produce plainer code and more obvious dead ends, while frontier models can keep digging into clever but fragile solutions that create cleanup work. That makes a weaker local model surprisingly attractive when you care more about maintainability and bounded damage than peak intelligence.

Evaluate models on failure cleanup time, not just benchmark wins. For agentic coding, a model that is easier to interrupt and correct can outperform a smarter model in total developer time.
- sieste #1
- porridgeraisin #1
- freakynit #1
02
Knowledge-backed local agents are already useful

The strongest use case was not general chat. It was a local agent with a lot of persistent task context, product history, and live inventory data that can act inside a constrained workflow. Once the answer is mostly in the context window, Qwen becomes competitive enough to handle real household automation that larger hosted assistants still have not nailed cleanly.

If you have a domain with structured context and repeatable actions, build the retrieval and tool layer before chasing a larger model. That is where local inference starts to look like a product, not a demo.
- eurekin #1
03
The speed number depends on finicky tuning

Several highly specific corrections landed on the same point. The reported throughput is not just about the two GPUs. It depends on exact sampling parameters, how MTP is configured, whether n-gram speculation is added, which Qwen variant you use, and whether the machine actually has GPU peer-to-peer paths working. Small config changes can move speed, context length, and stability enough that copying the hardware alone will not reproduce the result.

Treat local inference posts as reproducible experiments, not shopping lists. Capture backend version, flags, quant, and topology alongside the hardware or your team will waste days chasing phantom performance gaps.
- DiabloD3 #1
- iMil #1 #2
- cybertim #1 #2
04
Local inference is being bought as strategic control

The case for owning hardware was framed less as immediate ROI and more as insurance. People want predictable access, privacy, full control of sampler behavior and model internals, and a way to keep working if API pricing, terms, or regulation shift. Hosted inference is still the default for refinement and convenience, but many now want a local lane so they are not fully dependent on rented intelligence.

If AI is becoming part of your core workflow, design for supplier optionality now. A modest local stack can be worth it even if it loses on pure economics today.
- deng #1
- redfloatplane #1
- medfield #1
- alexhans #1
- PeterStuer #1
- Der_Einzige #1
05
Dense 27B and MoE serve different hardware niches

Comments made a useful distinction that the post itself mostly glosses over. Dense Qwen 27B is compute-heavy and rewards strong GPU bandwidth and careful tuning. MoE variants like Qwen 35B A3B can be much easier to run on weaker or mixed hardware because only part of the model activates each token, which is why people reported decent speeds on ordinary desktops, Apple Silicon laptops, and AMD setups that would struggle with dense 27B at the same quality target.

Choose the model architecture for your hardware first. If you are trying to make local inference work on constrained or non-NVIDIA machines, start with MoE models before optimizing dense ones.
- ydj #1
- DiabloD3 #1
- stared #1
- alexjplant #1
- mappu #1
- ThunderSizzle #1

Against the grain

01
Hosted Qwen can feel much slower than top APIs

Using Qwen through OpenRouter and DeepInfra still produced 60 second waits for a full answer in one report, while Claude Haiku or Gemini Flash-class models finished in a few seconds. That cuts against the idea that open models are automatically the practical choice once the weights exist. The experience depends heavily on who is serving them and how aggressive their stack is.

Benchmark the actual provider, not the model family name. If your users care about response completion time, compare served open models against the fastest proprietary APIs before committing.
- neals #1
02
Cloud economics still beat hobby rigs for many users

For straightforward token consumption, paying a few dollars per million tokens can be cheaper and much simpler than buying multiple GPUs, power supplies, and dealing with heat and noise. One reply also pushed back on exaggerated OpenRouter complaints, noting that many limits come from the underlying provider rather than the router itself. If you do not need deep inference controls or privacy, local ownership is easy to over-romanticize.

Do the math on your real monthly usage before pitching on-prem inference internally. Many teams should keep local setups for experimentation and fallback, not as the default production path.
- deng #1
- jubilanti #1
03
Power and thermals are a real product constraint

Even when the headline sounds desktop-friendly, two high-end GPUs can pull hundreds of watts, dump serious heat into a room, and demand careful PSU and power-limit choices. Some people reported lower real draw than feared, especially with power caps, but nobody disputed that noise and thermals become part of the operating model once you move beyond a single card.

Budget for facilities, not just silicon. If you want local inference in an office or home environment, include power limits, cooling, and acoustic tradeoffs in the decision from day one.
- well_ackshually #1
- washadjeffmad #1
- iMil #1
- pier25 #1

In plain english

DeepInfra ↩: A cloud provider that serves machine learning models through an API.
MoE ↩: Mixture of Experts, a model design where only some sub-networks are activated for each token.
MTP ↩: Multi-Token Prediction, a technique where a model predicts multiple future tokens to speed up decoding under some conditions.
n-gram speculative decoding ↩: A decoding optimization that proposes likely next token sequences based on repeated token patterns and then verifies them with the main model.
OpenRouter ↩: A service that routes requests across many AI model providers through one API and also publishes model rankings and pricing views.
peer-to-peer GPU ↩: A hardware path that lets GPUs exchange data directly without routing everything through the CPU or system memory.
Q8 ↩: An 8-bit quantization level for model weights that preserves more quality than lower-bit formats while still reducing memory use.
Qwen 3.6 27B ↩: A 27-billion-parameter open-weight language model from Qwen, discussed here as a local coding model.
tok/s ↩: Tokens per second, a speed measure for how quickly an AI model generates output.
VRAM ↩: Video random-access memory, the memory attached to a graphics processor and used for textures and other graphics data.

Reference links

Model files and quantized variants

Qwen3.6-27B uncensored heretic v2 GGUF
Suggested as a better abliterated Qwen 3.6 27B variant than the one used in the post.
Qwen3.6-27B-ExCal-EXL3
Recommended as an alternative quantized build that might perform better.
Qwen3.6-27B-ExCal-Micro-EXL3
Recommended as a smaller EXL3 variant that may fit on a single GPU.
Qwen3.6-27B-MTP-exl3
Linked as a draft model for MTP speculative decoding.
Qwen3.6-27B-DFlash-exl3
Linked as another draft model option for faster decoding.

Hardware references

RTX 5080 specs on Flopper
Used to ground discussion of the 5080’s memory and performance characteristics.
RTX 3090 specs on Flopper
Used to ground discussion of the 3090’s memory and bandwidth characteristics.
AliExpress PCIe riser used in the build
The exact riser cable the author said they bought for the setup.

Performance tracking and setup guides

Spark Arena leaderboard
Suggested as a community source for recipes and real-world inference performance data.
Aider with OpenRouter blog post
Referenced as an example of people publishing practical local and hybrid coding workflows.