Speculative KV coding: losslessly compressing KV cache by up to ~4×
- AI
- Infrastructure
- Hardware
The post sketches a way to losslessly compress an LLM’s KV cache. Instead of storing every key and value vector directly, a small deterministic predictor reconstructs an approximation and the system stores only the residual, which is then entropy-coded. The pitch is straightforward: KV cache size grows with context length, and for long-context inference that cache can dominate VRAM use and memory bandwidth even after the usual caching tricks.
If you run large-model inference, treat this as a bandwidth and VRAM tradeoff, not a free storage win. It looks most relevant for very large models and long contexts where decode is memory-bound, and much less relevant for small models or setups that can cheaply persist KV elsewhere.
- fergusfinn.com
- Discuss on HN