The post argues that coding agents become much more useful once you stop treating them like autocomplete and start building a “harness” around them. In practice that means strict architectural constraints, lots of automated tests, CI gates, local and browser-based verification, observability access, in-repo docs and work logs, and review loops where agents critique and fix each other’s output. OpenAI says a team of three used this approach to build an internal beta that later influenced the Codex app, ending up with about a million lines of mostly AI-written code and unusually high pull request throughput.
What people bought was the workflow, not the flex. The durable takeaway was that “harness engineering” is mostly a new label for disciplined software engineering adapted to unreliable agents. Give the model a clear
dependency graph, high-fidelity local environments,
static analysis, tests it can run itself, and compact project memory it can update instead of endlessly rediscovering context. Several people said this matches what actually works in their own repos. The article’s concrete ideas around enforced layers,
provider boundaries, work logs, browser tests, and feeding traces and metrics back into the agent landed as the real value.
What people rejected was the way OpenAI chose to prove the point. A million lines of code was read less as evidence of capability than as a warning about duplication, verbosity, and future maintenance cost. Many treated the piece as a scale demo, not a quality demo. That distinction mattered more once the author clarified the hidden project was primarily an
Electron app with small hosted backend services, and that the team spent much of its time tuning prompts, skills, and docs, while discarding more than half of agent runs that missed the bar. That made the strongest version of the claim narrower and more believable: with enough scaffolding, agents can keep making progress inside a large codebase. It did not convince many people that agent-first development is already a good default for long-lived production systems, or that raw throughput says much about product quality, economics, or maintainability.