Two Qwen3 models on one DGX Spark: the residency math
- AI
- Infrastructure
- Hardware
- Open Source
The post is a hands-on note about running two open-weight Qwen3 models on a single DGX Spark by treating memory as a residency problem, not a simple sum of model file sizes. The author’s key claim is operational: shared CUDA overhead creates a real floor of about 5 GiB, so the right playbook is to load the larger model first, then size the second one against actual observed residency. The post also reports a failure mode with Qwen3-Next in “thinking” mode. Automatic tool choice did not fail because of parser settings. The model reasoned inside `<think>` and simply never emitted a tool call. Swapping from the Thinking backbone to the Instruct backbone fixed it.
If you are sizing local inference hardware, budget against measured memory residency and framework overhead, not brochure VRAM numbers or target utilization settings. And if your workflow depends on tool calling or coding quality, test the exact model variant and quantization first, because those choices can break behavior long before raw token speed does.
- devashish.me
- Discuss on HN