Harness engineering: Leveraging Codex in an agent-first world

AI
Programming
Developer Tools
Open Source

The post argues that coding agents become much more useful once you stop treating them like autocomplete and start building a “harness” around them. In practice that means strict architectural constraints, lots of automated tests, CI gates, local and browser-based verification, observability access, in-repo docs and work logs, and review loops where agents critique and fix each other’s output. OpenAI says a team of three used this approach to build an internal beta that later influenced the Codex app, ending up with about a million lines of mostly AI-written code and unusually high pull request throughput.

Steal the operational ideas, not the marketing metric. If you want agents to work on real codebases, invest in deterministic checks, narrow architecture boundaries, and self-verification first, then measure quality, maintenance burden, and spend rather than celebrating raw code volume.

June 7, 2026
openai.com
Discuss on HN

Discussion mood

Interested but skeptical. People liked the concrete process tips and many said the setup mirrors what already works for them, but the dominant reaction was distrust of LOC-heavy marketing, concern about slop and maintenance, and frustration that OpenAI withheld the product and repo while presenting a quality-neutral throughput story as evidence.

Key insights

In-repo work logs become agent memory

Keeping logs, plans, and prior decisions inside the repo gives future agent sessions something better than a fresh start. The useful part is not “documentation” in the old sense. It is searchable operational memory about what was tried, why it failed, and what remains open. That changes the workflow from repeated rediscovery to cumulative progress, though stale error logs can poison later sessions if you do not keep them curated.

If you use agents across days or weeks, add a lightweight work-log convention to the repo and make the agent update it after meaningful changes. Treat those files as active memory that needs pruning, not as archival docs.

Attribution:

shepherdjerred #1 #2 #3
DenisM #1

Small files and compact docs help agents think

Agents do worse when they have to slurp huge files or bloated markdown into context just to solve a local problem. People reported better results when they kept files small, condensed stale docs, and structured project knowledge as linked pages or indexes so the model can fetch only what it needs. This is not just about human readability. It is about avoiding a context-length death spiral where irrelevant tokens degrade future generations.

Refactor for agent navigation, not just human taste. Split oversized files, collapse stale docs, and provide an index or static site so tools can select context instead of brute-forcing it.

Attribution:

stult #1
everforward #1
satvikpendem #1
vibcdingenjoyer #1

The fancy harness is mostly enforceable architecture

The most practical reading of the article is not mystical agent orchestration. It is plain layered architecture with hard mechanical boundaries. Separate domains, force one-way dependencies, keep cross-cutting concerns behind explicit provider interfaces, and reject illegal imports in CI. Several readers pointed out that this is standard discipline in languages and shops that already care about dependency control. Agents just make those constraints mandatory earlier because they generate spaghetti faster than humans do.

Before adding more agent autonomy, make your dependency graph machine-checkable. Illegal imports, layer violations, giant files, and duplicate code should fail automatically, not wait for code review.

Attribution:

nimonian #1
shepherdjerred #1
iso1337 #1

Verification beats autonomy claims

The teams getting decent results all converged on the same idea. Let the agent do more only after you give it ways to prove itself wrong. High-fidelity local environments, smoke tests, end-to-end tests, browser automation, static analysis, observability, and coverage checks matter more than clever prompts. Left unattended, people said agents hardcode, bloat files, and generate slop. Tight verification loops are what make the whole setup tolerable.

Put budget into testability and local reproducibility before you chase full autonomy. If the agent cannot reliably validate its own work, you are scaling supervision problems, not engineering output.

Attribution:

shepherdjerred #1 #2
c0rruptbytes #1
mohsen1 #1

Engineers shift from coding to harness tuning

The most revealing detail from the OpenAI responses was that the engineers’ main job became adjusting skills, prompts, doc files, and constraints whenever the model produced bad code. More than half of runs were thrown away. That reframes the labor. The gain is not “three people replaced a large team.” It is that the scarce human skill moves up a level into steering, guardrail design, and triage while code generation becomes cheap and disposable.

Plan for engineering time to move into harness maintenance and failure analysis. If you adopt agents seriously, success depends on who can design constraints and know when to discard output, not who can accept the most generated code.

Attribution:

zbrock #1 #2
therealdrag0 #1

The hidden project was a narrow internal prototype

Once pressed, OpenAI disclosed that the unnamed system was mainly an Electron app with a small backend, built first as an internal prototype and later feeding into the Codex app. That makes the article easier to interpret. The claim is about a constrained product category and an internal environment with privileged access to tools, models, and iteration speed. It is not evidence that the same process is already proven for databases, safety-critical services, or large public production systems.

Use this as evidence that harnessed agents can accelerate internal product prototyping. Do not treat it as proof that the same setup generalizes to high-correctness or infrastructure-heavy systems without a lot more evidence.

Attribution:

zbrock #1 #2 #3

Against the grain

Software may not be getting worse because of AI

Blaming recent product annoyances on generated code is too convenient. Plenty of bad software predates coding agents, and much of engineering has always been cleaning up mistakes from capable humans. The stronger point is that anecdotes about glitchier apps do not tell you whether AI made quality worse. They mostly show that software quality is hard to judge casually.

Do not attribute every regression or ugly UX decision to agent coding without evidence. Track defect rates, incident data, and maintenance outcomes inside your own team before drawing conclusions.

Attribution:

Art9681 #1

LOC is crude but codebase scale still matters

Several people pushed back on the idea that mentioning code size is automatically meaningless. For this use case, the intended claim is not “more lines means better software.” It is “agents did not collapse when operating inside a large-ish evolving codebase.” That does answer one longstanding objection, even if LOC remains a lousy stand-in for quality or productivity.

Do not overread the metric, but do separate scale capability from quality capability. A system that can survive a big repo is useful progress even if it tells you little about elegance or long-term maintenance.

Attribution:

jstummbillig #1
B-Con #1
zbrock #1

Expertise still matters when the typing disappears

The bleakest reading in the comments was that agent-first development makes senior engineering skill obsolete. A credible rebuttal was that domain knowledge, architectural vocabulary, and the ability to specify good requirements become even more valuable when models can execute quickly. The architect analogy landed here. Better tools compress implementation time, but they do not erase the advantage of knowing what to ask for and how to judge it.

Invest in engineers who can specify systems clearly and evaluate tradeoffs, not just produce syntax. Those skills remain valuable even if code generation itself keeps getting cheaper.

Attribution:

linsomniac #1 #2 #3

In plain english

CI ↩

Continuous Integration, the automated process that runs builds and tests when code changes are submitted.

dependency graph ↩

The map of which modules or packages depend on which others inside a codebase.

Electron ↩

A framework for building desktop apps with web technologies like JavaScript, HTML, and CSS, bundled with a Chromium browser runtime.

end-to-end tests ↩

Tests that exercise a system from the user-facing interface through the full stack to verify complete workflows.

LOC ↩

Lines of code, a simple count of how much source code exists or was added, often used as a rough and controversial productivity metric.

observability ↩

Tools and data such as logs, metrics, and traces that help engineers understand what a running system is doing.

provider boundaries ↩

Explicit interfaces through which shared services like auth or telemetry are accessed, instead of letting code depend on them directly everywhere.

static analysis ↩

Automated checking of code without running it, used to catch errors, style violations, or architecture problems.

Reference links

OpenAI and related talks

Latent Space interview on harness engineering
Interview mentioned as a more detailed explanation of the approach from one of the people behind the post
Ryan’s London talk on harness engineering
Video talk version of the harness engineering ideas

Example repos and workflows

shepherdjerred monorepo docs and logs
Example of keeping agent work logs and project memory in-repo
tsz project
A public project offered as a comparable experiment in large-project AI workflows
smallos
Homebrew operating system repo cited as an example of updating docs and using agents with validation
rebuild-and-ruin
Side project shared as an example of a single-agent workflow backed by many validation rules
docs-cli
Tool for managing specs and reproducibility-oriented agent workflows
Agent Playbook Suite blog
Companion framework and writeup for specs-as-source and agent workflow ideas

Harness and skill references

ShipSmooth lifecycle skill
Concrete example of a lifecycle-oriented agent skill file inspired by the article
ShipSmooth demo
Demo linked alongside the skill-based workflow example
ShipSmooth refine rules
Example rules added to improve generated code quality
Matt Pocock talk on AI coding workflow
Recommended talk showing another practical approach to agent-assisted development

Architecture and engineering references

Hexagonal architecture
Background on the layered architecture style referenced in explanations of the article
Dependency injection
Background on the provider terminology used in the post
ghuntley Ralph Wiggum Loop
Linked to explain the self-review and iteration loop described in the article

Papers and historical references

Language Models are Unsupervised Multitask Learners
Cited in a side discussion about GPT-2 showing transfer abilities beyond explicit training examples
STEPS 2007 Progress Report
Reference for the claim that core personal computing can fit in a far smaller codebase
STEPS final report
Final report for the STEPS project mentioned in the code-size comparison

Other references

Operant conditioning explainer
Linked in a comment comparing LLM usage to variable-ratio reward systems