DSpark: Speculative decoding accelerates LLM inference [pdf]
- AI
- Open Source
- Infrastructure
- Developer Tools
- Economics
DeepSeek posted a paper describing DSpark, an inference system built on speculative decoding. Instead of having the full model generate every token one by one, a smaller draft path proposes likely next tokens and the main model verifies them in batches. The paper claims this removes enough bottlenecks to deliver materially higher throughput in production, and commenters pointed out that DeepSeek already says it replaced its prior MTP-1 setup with DSpark shortly after the V4 preview launch. That landed as the practical story here, not “speculative decoding exists,” because Google published the core idea in 2022. What DeepSeek is showing is an implementation that keeps the speedup alive at larger scale by improving the drafter and the verification policy, then shipping it in released model weights and code.
If you buy or build on LLMs, track inference engineering as closely as model benchmarks. Price, latency, and deployability are now moving through systems tricks like DSpark, and those gains can erase a closed provider’s margin faster than another headline model release.
- github.com
- Discuss on HN