Can I Buy Your KV Cache?

AI
Infrastructure
Developer Tools

The paper proposes treating a document’s key-value cache as a tradable artifact. A publisher would precompute the cache for a model, then other agents could buy it, load it, and skip the expensive prefill pass before generating answers. In plain terms, it is pitching a market for precomputed intermediate model state.

Treat this as a reminder that KV caching is already table stakes in LLM serving, while cross-context and cross-model cache reuse is still a real research problem. If you are building agent or retrieval systems, you will get more practical wins from provider-side prefix caching and better document retrieval than from betting on a secondary market for cached activations.

June 12, 2026
arxiv.org
Discuss on HN

Key insights

AsyncResoning shows what real KV reuse looks like

AsyncResoning is useful here because it targets the actual open problem, not simple shared-prefix caching. It gives multiple agents different views over the same cache and compensates for positional embedding mismatches by rotating query projections per block. The result is only approximate, but that is the point. Nontrivial KV reuse requires explicit machinery to manage broken assumptions about order and position.

If you want reusable caches beyond exact repeated prefixes, look at approximation methods like AsyncResoning instead of market design. Expect accuracy trade-offs and model-specific engineering work, not a plug-and-play artifact you can trade like a file.

Attribution:

dvmazur #1

Good primers were buried in the replies

Two concrete learning resources stood out. The paper at arXiv:2207.09238 was recommended for the math behind KV caching, and 3Blue1Brown’s transformer video was suggested as the gentler on-ramp. That matters because a lot of the confusion in this story comes from people using "KV cache" as a buzzword without understanding what is actually being cached.

If your team is making product or infra decisions around LLM serving, get at least one engineer grounded in the mechanics first. It will help you separate normal prefix caching from the much harder problem of cache composition and reuse.

Attribution:

wren6991 #1

Against the grain

The market idea is coherent if composition works

The charitable read is that the paper is not about caching a system prompt, but about a future where KV(A || B) can be assembled from KV(A) and KV(B), popular documents can be identified ahead of time, and buying a cache beats recomputing it. That still depends on unsolved technical work and questionable economics, but it frames the paper as a speculative market design layered on top of a real research target.

Do not dismiss the entire concept just because this paper is weak. If composable KV reuse becomes reliable, new distribution and pricing models for shared context could appear around high-traffic corpora or enterprise knowledge bases.

Attribution:

entrope #1

Cloudflare scraping products point in this direction

One commenter connected the paper to Cloudflare’s recent scraping-related work and argued that shared precomputation markets fit the broader trend of monetizing access patterns around AI pipelines. That does not rescue the paper’s implementation details, but it does suggest the commercial instinct is not crazy. Infrastructure vendors are already looking for places to insert caching and tollbooths around repeated model work.

Watch infrastructure companies more than academic proposals here. If this idea goes anywhere, it will likely show up first as a serving feature or platform pricing primitive, not as an open market for standalone KV blobs.

Attribution:

TuringNYC #1

In plain english

AsyncResoning ↩

A research paper proposing concurrent cache views for multiple agents, using transformations to tolerate positional inconsistencies.

CacheBlend ↩

A research paper on approximately stitching together independently computed KV cache segments for transformers.

inference ↩

Running a trained AI model to produce outputs, as opposed to training the model.

KV cache ↩

Key-value cache, stored intermediate attention data that lets a language model avoid recomputing the entire prompt on each generation step.

llama.cpp ↩

A popular open source project for running language models efficiently on local hardware.

positional embedding ↩

A method transformers use to encode token order so the model can tell where each token appears in a sequence.

prefill ↩

The inference phase where a model processes the input prompt before generating new tokens.

prefix caching ↩

Reusing cached model state for an identical prompt prefix that appears again in later requests.

vLLM ↩

An open-source inference engine for serving large language models efficiently.

Reference links

Research papers on KV caching and reuse

Can I Buy Your KV Cache?
The submitted paper proposing a market for precomputed KV caches.
AsyncResoning
Cited as an example of approximate cache reuse across different cache views and agent concurrency.
CacheBlend
Referenced as prior work on stitching together independent KV cache prefills.
A mathematical primer on KV caching
Recommended as a more rigorous explanation of the mechanics behind KV caching.

Introductory explainer

3Blue1Brown transformer video
Suggested as a gentler introduction to the concepts behind transformer attention and KV caching.

Can I Buy Your KV Cache?

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Research papers on KV caching and reuse

Introductory explainer