Don't trust large context windows

AI
Developer Tools
Programming

The post says large context windows are not the fix they appear to be. In practice, coding agents often get worse well before they hit the headline limit, forgetting instructions, reintroducing bad ideas, or drifting into what the author calls a "dumb zone." That claim landed because plenty of people have seen the same pattern, even if they disagree about where the failure starts. The more useful distinction was not raw token count but context quality. Long windows full of stale plans, failed attempts, contradictory instructions, or noisy tool output are what poison a session. A large window packed with focused, relevant material can still work.

If your team is leaning on million-token marketing, stop assuming a single long-running agent session will stay sharp. Build agent workflows that keep the working set small, force constraints in code rather than prompts, and carry state forward through structured artifacts like plans, docs, and tests.

June 14, 2026
garrit.xyz
Discuss on HN

Key insights

Make the root agent an orchestrator

Turning the top-level agent into a thin coordinator changes the problem. Instead of letting the main thread read files, call tools, and accumulate every detour, it can only dispatch subcalls and receive summaries back. That keeps the long-lived context small while still allowing huge total token usage in short-lived branches. The important part is that the restriction is enforced in the tool implementation, not merely requested in the prompt, because the model will routinely ignore soft instructions once the session gets messy.

If you are building or choosing an agent harness, prioritize hard permissions and subagent boundaries over clever prompt wording. A root orchestrator plus disposable worker contexts is a better default than one omnipotent session.

Attribution:

bob1029 #1 #2 #3 #4

Docs and tests beat AI memory

The useful alternative to memory features is ordinary software discipline. Concise Markdown plans, indexes, and checklists checked into the repo give the model durable context without polluting every turn with stale facts. Tests, lint rules, and style enforcement do even more because they turn preferences into deterministic guardrails. Several people were blunt that auto-saved memories often preserve the wrong thing and then spread that mistake into later sessions.

Do not rely on vendor memory layers for important project state or coding rules. Persist state in files the model can reload, and encode must-follow constraints in tests and tooling.

Attribution:

SwellJoe #1 #2
justinclift #1

Context performance is highly model and harness specific

Raw claims about "models" were treated as almost useless. Long-context behavior depends on the specific model version, attention design, training, agent harness, compaction strategy, and even the task shape. That explains why one person reports clean runs at 700k tokens while another sees obvious degradation below 100k. Both can be right inside their own setup. The wrong move is turning either experience into a universal rule.

Benchmark the exact stack you plan to use in production. Results from another model version or coding tool are weak evidence for your workflow, even if the vendor name is the same.

Attribution:

kelnos #1
HarHarVeryFunny #1
deliciousturkey #1

Bad context matters more than big context

What makes a session go bad is often not length by itself but contaminated state. Failed attempts, repeated wrong assumptions, and old instructions keep exerting weight simply because they remain in the window. That is why a focused 700k-token initial load can sometimes work, while a chaotic 80k-token session can already be lost. Several people described this as a tainted path problem, where once the model starts reasoning from bad premises it keeps snapping back to them.

Watch for friction, repeated missteps, and stale assumptions as the signal to reset. Trigger compaction or a fresh session based on context quality, not just token count.

Attribution:

wood_spirit #1
doginasuit #1
nijave #1

Structured planning files stabilize multi-session work

Teams are getting better results by making the model write product and design artifacts before it writes code. PRDs, design docs, phased implementation plans, and end-of-phase summaries give each coding session a clean brief and a stable record of past decisions. That reduces drift, makes handoffs easier, and creates better review inputs for both humans and other models. The notable part is that these files are lightweight and ad hoc, not heavyweight process theater.

For any non-trivial feature, have the model produce a short design artifact first and carry progress forward through phase summaries. This is a cheap way to improve consistency without trusting one long session.

Attribution:

kristianc #1 #2
magicalhippo #1
SeriousM #1

Turn anecdotes into rerunnable evals

The cleanest pushback against context folklore was not that degradation is fake, but that teams should stop arguing from vibes. The cited studies may already be dated for current frontier models, yet the right fix is straightforward. Build bounded tasks, rerun them across versions, and keep benchmarks current as models and harnesses change. Otherwise every conversation about long context collapses into irreconcilable war stories.

Set up a small internal eval suite for your actual agent tasks and rerun it on every model or harness change. Without that, you will optimize around memorable anecdotes and vendor demos.

Attribution:

lordgrenville #1
bhy #1
nijave #1
skybrian #1

Against the grain

Some users really do get good results deep into 1M context

There is credible resistance to the article's severity. Multiple people said recent Claude variants stayed usable at 400k to 900k tokens, especially on coherent tasks, and that newer releases appear materially better than earlier ones. The useful correction is that large-context performance is not uniformly terrible. In some workflows it is a real upgrade, not just marketing fluff.

Do not overcorrect into tiny-context dogma if your current stack is already handling long focused sessions well. Measure where your failure curve actually starts before redesigning your whole workflow.

Attribution:

kelnos #1
pdantix #1
csomar #1
daishi55 #1
kuboble #1

The field is too young for polished rigor

Some comments argued that the folk-wisdom feel is partly unavoidable at this stage. Coding-agent practice is only a few years old, model behavior changes every few weeks, and even influential architecture papers have openly admitted they did not fully understand why things worked. That does not excuse superstition, but it does mean expecting mature engineering theory right now is unrealistic.

Plan for a moving target. Favor workflows and benchmarks that can be updated quickly instead of waiting for clean universal laws about agent behavior.

Attribution:

darkwater #1
dindunuf #1
conditionnumber #1

Brute-force multi-model retries can be rational

One commenter rejected session micromanagement entirely and treated agents as black boxes. The workflow is simple: send the same task to several models, compare outputs, reset hard when the result is bad, and track which systems actually deliver. Wasteful on its face, but the argument is that interactive back-and-forth can consume even more context and lock you deeper into a bad trajectory. For some feature work, repeated clean starts may be cheaper than nursing one deteriorating session.

If you are spending lots of time rescuing drifting sessions, compare that cost against parallel first-pass runs from a few models. A blunt output-based workflow may outperform elaborate context management for some classes of task.

Attribution:

mg #1 #2 #3

In plain english

agent ↩

An AI tool that can take multi-step actions such as reading files, editing code, running commands, or using other tools.

compaction ↩

A technique where an agent summarizes earlier conversation history to fit more work into a model's limited context window.

evals ↩

Evaluations, usually repeatable tests used to measure how well a model performs on specific tasks.

harness ↩

The surrounding tooling and workflow that controls how a model is called, what tools it can use, and how results are checked.

LLM ↩

Large language model, a machine learning system trained to generate and understand text.

token ↩

A chunk of text a language model processes and bills against, used to measure input and output size.

Reference links

Benchmarks and papers

Evaluating the Sensitivity of LLMs to Prior Context
Shared as a paper directly relevant to whether earlier context degrades later performance.
On Layer Normalization in the Transformer Architecture
Quoted for its candid line about not fully explaining why certain architectures work.

Agent workflow tools and docs

OpenCode agents documentation
Referenced as an example of a customizable agent harness with per-agent tool controls.
Claude Code sub-agents documentation
Mentioned as documentation for delegating work to subagents to keep the main context cleaner.
Visual Studio Code custom agents documentation
Cited as another environment that supports agent customization and delegation.
Pi
Mentioned both as a tool being extended with a manual /last command and as a lightweight environment for understanding model behavior.
Transposing the agent loop
Shared as a write-up of a workflow based on many short agent loops generated from structured state.