HN Debrief

Two Qwen3 models on one DGX Spark: the residency math

  • AI
  • Infrastructure
  • Hardware
  • Open Source

The post is a hands-on note about running two open-weight Qwen3 models on a single DGX Spark by treating memory as a residency problem, not a simple sum of model file sizes. The author’s key claim is operational: shared CUDA overhead creates a real floor of about 5 GiB, so the right playbook is to load the larger model first, then size the second one against actual observed residency. The post also reports a failure mode with Qwen3-Next in “thinking” mode. Automatic tool choice did not fail because of parser settings. The model reasoned inside `<think>` and simply never emitted a tool call. Swapping from the Thinking backbone to the Instruct backbone fixed it.

If you are sizing local inference hardware, budget against measured memory residency and framework overhead, not brochure VRAM numbers or target utilization settings. And if your workflow depends on tool calling or coding quality, test the exact model variant and quantization first, because those choices can break behavior long before raw token speed does.

Discussion mood

Cautiously enthusiastic. People are excited that local LLMs are getting genuinely useful, especially for coding, but the mood is grounded by hard limits on speed, memory, quantization quality, and total cost.

Key insights

  1. 01

    Two Spark gains came from the stack

    The jump from roughly 11 to 14 tokens per second on one Spark to 40 to 50 on two was not just “more hardware.” It came from moving to vLLM, using two-way tensor parallelism, and getting multi-token prediction and tuned kernels working together. That turns the headline performance claim into a software story as much as a hardware one. It also explains why early results are uneven. People are using custom builds from Nvidia forum posts, which means support is still immature and benchmarks are sensitive to implementation details.

    Do not compare local inference boxes without pinning the serving stack and model features. If you are evaluating hardware, benchmark llama.cpp, vLLM, tensor parallelism, and speculative or multi-token decoding as a package.

      Attribution:
    • wolttam #1 #2
    • ttsiodras #1
  2. 02

    Quantization is where usefulness dies

    People with plenty of memory said the biggest trap is believing a model that fits is a model that works. Very low quantizations made large models feel worse in practice, especially for programming. The pattern was consistent. Around 2-bit was poor, 4-bit was merely acceptable, and 6-bit or higher was where coding started to hold together. One exception was DwarfStar with DeepSeek V4 Flash at IQ2_XXS, but even there the recommendation stayed the same: prefer a smaller model at Q8-class quality over a giant model crushed to fit.

    Choose your target quality floor before you buy hardware. If your main use case is coding or agents, test smaller higher-precision models first instead of planning around aggressive low-bit quants.

      Attribution:
    • pet_the_bird #1
    • embedding-shape #1
  3. 03

    Most local hardware looks better on paper

    The strongest practical advice was to stop thinking in terms of advertised memory and start thinking in terms of real throughput and concurrency. A 128 GB MacBook Pro or a single Spark can launch interesting models, but that does not mean they will be responsive enough for daily work. Even rented 8x H200 setups were reported as expensive and more concurrency-limited than expected for huge models like GLM 5.2. The point was not that local is dead. It was that many buyer fantasies are built around “can run” instead of “can run well.”

    Before buying, define acceptable latency and concurrency for your workflow and test against those numbers. If a setup only looks good when measured by model size, it is probably the wrong setup.

      Attribution:
    • ericd #1 #2
    • cpburns2009 #1
    • zackify #1
  4. 04

    Open model competition is sorting by runnability

    The emerging market split people described is revealing. DeepSeek V4 Flash is winning attention for speed and efficient serving. GLM 5.2 is seen as stronger on raw capability, but much harder to run because of memory overhead such as KV-cache footprint. Qwen still matters because it has models that fit on laptop-class hardware, even if it is no longer setting the pace at the top end. That reframes the post’s two-model residency trick. Efficient memory use is not an optimization detail. It is becoming a core competitive feature for open models.

    Track model families by deployment profile, not just benchmark score. For product planning, shortlist one “smart” model, one “fast” model, and one laptop-class fallback instead of betting on a single winner.

      Attribution:
    • simonw #1 #2 #3
    • CamperBob2 #1
    • zozbot234 #1
  5. 05

    Tool calling failure was a model choice bug

    The most actionable detail in the whole story is that the broken `tool_choice="auto"` behavior was not a parser or framework problem. The thinking variant internally reasoned and then never emitted the tool call at all. The instruct variant did. That is a sharp reminder that “same family, different backbone” can change agent behavior in ways no serving flag will fix.

    When evaluating models for agents, include tool-call emission as a first-class acceptance test. Do not assume a thinking model is a drop-in upgrade over the instruct version.

      Attribution:
    • barrkel #1
    • devashish86 #1

Against the grain

  1. 01

    Qwen may not stay behind for long

    The claim that Qwen is fading got pushback from people who think the absence of open 3.7 weights is a release strategy issue, not a capability issue. The view here is that Qwen still owns the laptop-friendly end of the market and has enough influence from the 3.6 line that a stronger open release could quickly matter again.

    Do not lock your stack to the current open-model pecking order. If portability and local deployment matter, keep watching Qwen releases even if DeepSeek and GLM have more momentum today.

      Attribution:
    • roger_ #1
    • syhol #1
    • simonw #1
  2. 02

    Local setups can be worth it anyway

    Not everyone framed local inference as a bad economic trade. One commenter running multiple models on a Spark said the point was freedom to tinker, no token-bill anxiety, and control over access, not beating cloud models on raw capability. That argument changes the buying decision from pure ROI to ownership and operational independence.

    If you are evaluating local hardware, separate “best model per dollar” from “best control per dollar.” For some teams, predictable access and zero metered usage are the product requirement.

      Attribution:
    • verdverm #1 #2

In plain english

CUDA
Compute Unified Device Architecture, Nvidia’s software platform for running general-purpose computation on its GPUs.
decode
The stage of inference where the model generates output tokens one by one or in accelerated batches.
DGX Spark
A compact Nvidia AI system for local model inference and development, built around Nvidia GPU hardware.
GiB
Gibibyte, a unit of digital memory equal to 1,024 mebibytes and slightly larger than a gigabyte in binary measurement.
H200
An Nvidia data center GPU used for high-end AI training and inference workloads.
Instruct
A model variant tuned to follow direct instructions and produce usable answers rather than extended internal reasoning.
IQ2_XXS
A very aggressive low-bit quantization format used in llama.cpp-style model deployments to make large models fit into less memory.
KV-cache
Key-value cache, the memory a transformer model stores from previous tokens so it can continue generation efficiently over long contexts.
multi-token prediction
A technique where a model predicts more than one future token at a time, which can be used to speed up decoding in some setups.
open-weight
A model release where the trained weights are provided for others to download and run, though the training data and full process may not be open.
prefill
The stage of inference where the model processes the input prompt and builds its internal state before generating output tokens.
Q8
An 8-bit quantization level commonly used as a higher-quality compressed format for model weights.
Qwen3-Next
A model variant in Alibaba’s Qwen family of open-weight language models.
Strix Halo
A codename for AMD Ryzen AI Max systems that combine CPU, GPU, and large shared memory for local AI workloads.
tensor parallelism
A way to split one model’s computation across multiple GPUs or machines so they can serve it together.
tokens per second
A throughput measure for language models showing how many text tokens they can process or generate per second.
vLLM
An open-source inference server for large language models that focuses on high throughput and efficient memory use.

Reference links

Model runtimes and benchmarks

Papers and technical references

  • DeepSeek-V3 paper
    Quoted for the claim that multi-token prediction modules can be repurposed for speculative decoding during inference.
  • DeepSeek-V4 paper
    Quoted to show that DeepSeek V4 keeps the same multi-token prediction strategy as DeepSeek V3.

Hosted inference and evaluation

  • Exoscale Dedicated Inference
    Suggested as an easy way to test a specific Hugging Face model and quantization through an OpenAI-compatible API before buying hardware.