The paper proposes treating a document’s key-value cache as a tradable artifact. A publisher would precompute the cache for a model, then other agents could buy it, load it, and skip the expensive prefill pass before generating answers. In plain terms, it is pitching a market for precomputed intermediate model state.
People tore into it because the paper stays in the easiest possible setting and presents it like a new architecture. The core idea only works cleanly when the cached text is a shared prefix rooted at the start of the prompt. That is already how
prefix caching works in production systems like
vLLM,
llama.cpp, and commercial APIs. Several commenters pointed out that the paper even “proves” that decoding from a cached prefill matches recomputing that same prefill, which reads less like a result than a restatement of deterministic computation.
The harder problem is the one the paper mostly sidesteps. KV caches are not portable blobs you can splice anywhere. They depend on token order, the tokens that came before them, positional encoding, and the exact model weights. That means a cache computed for one model is useless for another, and even within one model you cannot generally paste together independently computed cache segments without approximation tricks that may hurt accuracy. People called out papers like
CacheBlend and
AsyncResoning as examples of actual work in that direction, but the consensus was that this submission adds little beyond the framing.
The economic case also fell apart. Even if the technical hurdles were solved, commenters argued there is little reason for an
inference provider to buy a cache from a third party when it can compute and store the same cache itself, usually for a small one-time cost. Several people said the practical answer today is simpler: use prefix caching where prompts repeat, and avoid dragging giant documents into context in the first place. Search or retrieve the relevant fragments instead.