HN Debrief

Can I Buy Your KV Cache?

  • AI
  • Infrastructure
  • Developer Tools

The paper proposes treating a document’s key-value cache as a tradable artifact. A publisher would precompute the cache for a model, then other agents could buy it, load it, and skip the expensive prefill pass before generating answers. In plain terms, it is pitching a market for precomputed intermediate model state.

Treat this as a reminder that KV caching is already table stakes in LLM serving, while cross-context and cross-model cache reuse is still a real research problem. If you are building agent or retrieval systems, you will get more practical wins from provider-side prefix caching and better document retrieval than from betting on a secondary market for cached activations.

Discussion mood

Strongly negative. Most comments dismiss the paper as superficial, saying it repackages widely deployed prefix caching, ducks the real technical constraints on KV reuse, and offers an implausible business model for something providers can already compute and store themselves.

Key insights

  1. 01

    AsyncResoning shows what real KV reuse looks like

    AsyncResoning is useful here because it targets the actual open problem, not simple shared-prefix caching. It gives multiple agents different views over the same cache and compensates for positional embedding mismatches by rotating query projections per block. The result is only approximate, but that is the point. Nontrivial KV reuse requires explicit machinery to manage broken assumptions about order and position.

    If you want reusable caches beyond exact repeated prefixes, look at approximation methods like AsyncResoning instead of market design. Expect accuracy trade-offs and model-specific engineering work, not a plug-and-play artifact you can trade like a file.

      Attribution:
    • dvmazur #1
  2. 02

    Good primers were buried in the replies

    Two concrete learning resources stood out. The paper at arXiv:2207.09238 was recommended for the math behind KV caching, and 3Blue1Brown’s transformer video was suggested as the gentler on-ramp. That matters because a lot of the confusion in this story comes from people using "KV cache" as a buzzword without understanding what is actually being cached.

    If your team is making product or infra decisions around LLM serving, get at least one engineer grounded in the mechanics first. It will help you separate normal prefix caching from the much harder problem of cache composition and reuse.

      Attribution:
    • wren6991 #1

Against the grain

  1. 01

    The market idea is coherent if composition works

    The charitable read is that the paper is not about caching a system prompt, but about a future where KV(A || B) can be assembled from KV(A) and KV(B), popular documents can be identified ahead of time, and buying a cache beats recomputing it. That still depends on unsolved technical work and questionable economics, but it frames the paper as a speculative market design layered on top of a real research target.

    Do not dismiss the entire concept just because this paper is weak. If composable KV reuse becomes reliable, new distribution and pricing models for shared context could appear around high-traffic corpora or enterprise knowledge bases.

      Attribution:
    • entrope #1
  2. 02

    Cloudflare scraping products point in this direction

    One commenter connected the paper to Cloudflare’s recent scraping-related work and argued that shared precomputation markets fit the broader trend of monetizing access patterns around AI pipelines. That does not rescue the paper’s implementation details, but it does suggest the commercial instinct is not crazy. Infrastructure vendors are already looking for places to insert caching and tollbooths around repeated model work.

    Watch infrastructure companies more than academic proposals here. If this idea goes anywhere, it will likely show up first as a serving feature or platform pricing primitive, not as an open market for standalone KV blobs.

      Attribution:
    • TuringNYC #1

In plain english

AsyncResoning
A research paper proposing concurrent cache views for multiple agents, using transformations to tolerate positional inconsistencies.
CacheBlend
A research paper on approximately stitching together independently computed KV cache segments for transformers.
inference
Running a trained model to produce outputs from new inputs.
KV cache
Key-value cache, a memory structure that stores intermediate attention data so the model does not recompute everything for each new token.
llama.cpp
A widely used open source C and C++ inference engine for running language models locally.
positional embedding
A method transformers use to encode token order so the model can tell where each token appears in a sequence.
prefill
The stage where a model processes the input context before it begins generating output tokens.
prefix caching
Reusing cached model state for an identical prompt prefix that appears again in later requests.
vLLM
An open source high-throughput inference server for large language models.

Reference links

Research papers on KV caching and reuse

Introductory explainer