Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

AI
Developer Tools
Economics

The paper tries to put numbers on a question most teams are still hand-waving away: where do tokens actually get burned when you use agentic coding systems. Across 30 software engineering tasks, it reports that code review and debugging are the biggest sinks and that input tokens make up the majority of usage on average. That lines up with what many people are seeing in practice. Agents spend far more time reading code, tool outputs, and prior context than producing code. For large codebases, that imbalance can get extreme. Several people said they regularly see input-heavy ratios closer to 10:1, with agents ingesting huge amounts of context to make tiny edits.

If you are deploying coding agents, start tracking token spend by workflow stage now instead of treating it as one blended bill. The obvious levers are reducing context bloat, caching repeated prefixes, and pushing cheap models into refinement, review, and other grunt work before you negotiate anything with vendors.

June 7, 2026
arxiv.org
Discuss on HN

Key insights

Caching matters more after tool calls

Repeated tool use means the model often gets the whole conversation and prior context sent back again to decide the next step. That makes prefix caching one of the few immediately available cost levers for agentic coding, especially when agents bounce through many read and inspect actions before making a change.

Audit your agent traces for repeated context replay around tool calls. If your provider supports prompt caching, design the workflow so stable prefixes stay stable and get reused.

Attribution:

bob1029 #1 #2
kolinko #1
Phemist #1

Huge context windows are often a workflow bug

When an agent reads massive parts of a codebase to change a line, the failure is usually in retrieval and abstraction, not just model pricing. Better code navigation, summaries of what modules do, and higher-level structure can replace brute-force context stuffing and cut both cost and error rate.

Invest in codebase maps, language-server hooks, and module summaries before buying more tokens. Measure how much context each successful edit actually needed and tune toward that.

Attribution:

bob1029 #1
uxhacker #1
frumiousirc #1

Agents waste tokens on bad tests

A recurring failure mode is that coding agents generate piles of unit tests, then spend more tokens fixing tests that do not reflect the real behavior you care about. Without explicit requirements for runtime checks or broader validation, they default to verbose test churn that looks productive while burning budget.

Specify the validation strategy up front for coding tasks. Tell the agent when to run dynamic checks, integration tests, or smoke tests so it does not hide behind unit-test volume.

Attribution:

sakuraiben #1
drivebyhooting #1
make3 #1

Prompt refinement is a cheap place to save money

Several builders said the biggest practical gain came before coding starts. A cheap model that interrogates the request, exposes missing constraints, and rewrites the task can reduce downstream waste. The catch is that weak refinement questions can steer the whole process wrong, so this step needs quality control.

Add a refinement stage before any expensive planning or coding pass. Track whether better task framing lowers total tokens and rework rather than judging refinement by its own cost.

Attribution:

monkeydust #1 #2
Cherub0774 #1
whattheheckheck #1

Mixed-model pipelines will pressure pricing

The strongest market read was that token pricing will not stay attached to today's frontier API rates for every step. Open-weight models and cheap inference providers are already good enough for review, refinement, and other lower-stakes work. That pushes teams toward routing workflows by task instead of paying flagship prices for every token.

Break your agent stack into stages and test model substitution step by step. The savings are more likely to come from routing and provider competition than from waiting for one vendor to cut prices.

Attribution:

jpatt #1
avianlyric #1
mobelkh #1
oersted #1

Against the grain

Token costs may not stay painful

Faster hardware and better cost-performance could make today's obsession with token thrift look temporary. If inference keeps getting cheaper underneath the APIs, teams that over-optimize every prompt now may be solving a short-lived pricing problem instead of the longer-term product problem.

Do the easy efficiency wins, but avoid architecture that only makes sense under permanently scarce tokens. Revisit your assumptions as hardware and serving costs move.

Attribution:

Retric #1 #2

On-prem inference is not an easy escape

The hope that local hardware will quickly erase cloud token costs runs into ugly systems details. Commodity NPUs are not automatically good at inference, and smaller models lose real capability that extra chain-of-thought or agent scaffolding does not fully restore. Cheap local inference helps some batch workloads, but it is not a universal replacement.

Test local inference on your actual tasks before planning around it. Batch review jobs may fit, but do not assume on-prem hardware can absorb your full coding-agent workload.

Attribution:

emsign #1
zozbot234 #1

GitHub pricing shock was a product change

The sharp Copilot quota jump that triggered some of the cynicism is not clean evidence that all token pricing is arbitrary. Part of the pain came from a specific packaging change, not just from providers randomly deciding what a token should cost.

Separate vendor-specific plan changes from underlying model economics when you forecast spend. Your biggest risk may be contract terms and packaging, not raw inference cost alone.

Attribution:

altmanaltman #1

In plain english

abstract syntax tree ↩

A tree representation of source code structure that captures the program's syntax in a form tools can analyze.

agentic ↩

Describing a workflow where the model takes multiple steps, uses tools, and iterates toward a goal rather than answering once.

API ↩

Application programming interface, a way for software to call another service programmatically.

inference ↩

Running a trained model to generate outputs from new inputs.

prefix caching ↩

A serving optimization that reuses computation for repeated prompt prefixes so repeated agent turns cost less.

prompt ↩

The input text and instructions given to a language model.

quota ↩

A usage limit set by a vendor, such as a cap on tokens or requests within a billing period.

Reference links

Paper and related reading

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering
The submitted paper measuring token use across software engineering tasks.
Steering LLM Thinking with Budget Guidance
Prior paper mentioned as related work on controlling model token use with budget guidance.
The Coming Age of Tokenomics
A Substack post shared as earlier commentary on AI token economics.

Builds and demos

Multi-agent system demo video
A demo of a homebuilt multi-agent workflow with prompt refinement and parallel strategies.

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Paper and related reading

Builds and demos