When I reject AI code even if it works

AI
Programming
Developer Tools
Software Engineering

The post lays out a simple rule set for AI-generated code: passing tests is only the floor. Code still gets rejected if the author cannot explain it in plain language, if the change is larger than the problem, if it introduces abstractions before they are needed, or if it leaves the system harder to maintain. That framing landed with most readers because it maps cleanly onto normal engineering judgment. Plenty of people said you should hold coworker code to the same bar. The difference is that AI makes it cheap to produce a lot of plausible code very quickly, which turns ordinary bad habits into a scaling problem.

The strongest comments came from people using coding agents in domains where hidden mistakes are expensive. In machine learning, several described subtle data leakage and evaluation errors that looked fine until an expert inspected the logic. In payments and other high-integrity systems, people reported code that passed many tests while quietly violating invariants or layering on absurd abstractions. The recurring theme was not that AI always fails. It was that it fails in ways that look polished, and junior or rushed engineers often cannot tell when the polish is fake. That pushed the conversation toward process, not model quality. Readers kept circling back to review culture, accountability, and scope control. Good use cases were narrow and boring: boilerplate, ports between frameworks or languages, syntax help, API discovery, and isolated low-risk tools. Bad use cases were agent-written feature work in unfamiliar codebases, large refactors, and any code path where the operator could not defend the design under pressure. Several people said the real danger is teams that already reward ticket closure, giant diffs, and shallow review. In those environments, AI does not create the dysfunction. It turns it into comprehension debt and tech debt at machine speed. A smaller but notable group argued that this is being overstated. Some are successfully using multiple models to critique plans and implementations, plus custom linting and hook-based checks to catch recurring AI mistakes before a human review. Others said the comparison to libraries and large inherited codebases is unavoidable. We already ship software built on components we do not fully understand line by line. The unresolved question was whether teams can build the right abstraction layer for AI-produced code, where behavior is verified by tests, contracts, and architecture boundaries instead of direct human inspection of every line. Most people were not ready to trust that yet, especially for production systems that matter.

Treat AI code like accelerated junior output, not trusted automation. If your team cannot enforce understanding, scoped changes, and strong review, AI will amplify your existing process failures faster than it creates value.

June 21, 2026
vinibrasil.com
Discuss on HN

Key insights

ML leakage errors look legitimate

Model-generated machine learning code can hide evaluation bugs that are hard to spot even for experienced practitioners. The concrete example was data leakage in calibration and holdout logic, where the code and metrics looked plausible until someone with domain knowledge traced the data flow and found that future information had leaked into evaluation. That changes the risk profile of AI coding in ML because passing tests and decent scores do not tell you the experiment is valid.

Do not let agents design or validate ML evaluation pipelines without expert review. Add explicit checks for leakage, row splits, and label contamination before you trust any reported metric.

Attribution:

abhgh #1
nostrebored #1

Agents default to needless abstraction

The common failure mode was not just wrong code. It was code that solved simple problems with elaborate scaffolding, duplicate helpers, and architecture that fights the existing system. Examples ranged from payment flows with subtle accounting errors to frontend loops replacing obvious database aggregation and layout changes that ignored established patterns. The code often looked polished at a glance, which is exactly why it is dangerous.

Constrain agents with examples from your existing codebase and ask for the smallest change that fits local conventions. Reject any first pass that expands the abstraction surface faster than the feature demands.

Attribution:

figassis #1
itopaloglu83 #1
danfritz #1

AI removes the senior engineer's refusal instinct

Good senior engineers do not start invasive work by guessing. They ask questions, map the system, write tests around unknown behavior, and sometimes refuse work until they understand it. Coding agents never do that on their own. They charge ahead with full confidence, which means the user has to supply the caution and the stopping power that an experienced human would naturally bring.

Build explicit stop conditions into your workflow for unknown code paths, risky changes, and missing context. If the task would require a human to slow down and investigate first, your agent workflow should do the same.

Attribution:

kerkeslager #1
mkozlows #1
Agentlien #1

AI accelerates existing org dysfunction

The problem is bigger than model quality. Organizations already reward giant commits, hero behavior, weak review, and shipping at any cost. AI lets those same incentives produce more code, more hidden coupling, and more cleanup work for the few people who still understand the system. The likely outcome is not dramatic 'software bankruptcy' announcements but slower delivery, senior attrition, and expensive rewrite or modernization efforts that arrive too late.

Audit your incentives before you roll out coding agents broadly. If you reward output volume over code ownership and review quality, AI adoption will show up later as retention, reliability, and velocity problems.

Attribution:

busterarm #1
onion2k #1
danaris #1

The productive middle ground is narrow

The strongest practical pattern was using AI as a power tool rather than an autonomous coder. People reported real gains when they kept architectural control and used models for boilerplate, examples, docs spelunking, ports, tests, and other fussy but readable work. That middle ground exists, but only when changes are scoped tightly enough that a human can still absorb and defend them.

Start with use cases where review is cheap and failure is reversible. Measure value on reduced drudgery, not on total lines generated or number of tickets closed.

Attribution:

coffeefirst #1
Snacklive #1
resonious #1
unknownfuture #1
teaearlgraycold #1

The real open question is verification

A few comments pushed past the usual 'just read the code' answer. They pointed out that software already relies on abstractions, third-party libraries, and code no single person fully understands. The harder problem is whether AI-generated modules can be trusted through contracts, tests, isolation, and architecture boundaries rather than line-by-line comprehension. That is a more interesting question than whether every generated diff feels elegant.

Invest in stronger interface contracts, property tests, and sandboxed component boundaries if you want AI use to scale responsibly. Without those verification layers, you are forced back to expensive manual code reading.

Attribution:

edanm #1
CraigJPerry #1

Against the grain

Custom lint harnesses can tame agents

One experienced user argued that rejecting rough first drafts misses the point. In this view, the right move is to encode recurring AI mistakes into custom linters, pre-commit hooks, and agent feedback loops so the model fixes dumb patterns before a human ever looks at the change. The claim is not blind trust. It is that teams can move quality checks earlier and make agent output conform to local rules at scale.

If your codebase has clear mechanical standards, try turning them into executable checks instead of relying on repeated human correction. This works best for structural mistakes, duplication, and known house-style failures.

Attribution:

cadamsdotcom #1 #2 #3

Throughput pressure is already changing behavior

One senior developer said most of their daily output is now AI-written and admitted they can no longer review everything in depth. The justification was speed pressure from management and market expectations, plus the belief that experienced developers can still tell when deep scrutiny is required. It is an uncomfortable data point because it sounds irresponsible to many readers, yet it is probably close to how adoption is actually happening inside companies.

Assume some teams are already trading review depth for delivery speed. If you lead engineering, decide that policy explicitly now rather than letting it emerge by deadline pressure and tool availability.

Attribution:

SunboX #1 #2 #3

Multi-model review can work on greenfield projects

A few builders described using several models to critique one another's plans and implementations while keeping architecture docs current and inserting human direction at key points. They reported greenfield projects growing to tens of thousands of lines over six months with acceptable results, even when the human did not understand every detail before merge. That does not settle the long-term maintainability question, but it shows some teams are operationalizing agent-heavy workflows rather than merely experimenting.

If you want to test agent-heavy development, do it first on greenfield internal projects with clear rollback options. Track maintenance burden over time, not just initial delivery speed.

Attribution:

BobbyTables2 #1
wwind123 #1

In plain english

API ↩

Application programming interface, the defined way one piece of software interacts with another.

data leakage ↩

A modeling mistake where information from the evaluation or future data accidentally reaches the training process, making results look better than they really are.

ML ↩

Machine learning, a field of software that learns patterns from data to make predictions or decisions.

Reference links

Benchmarks and evaluation

Bullshit Benchmark viewer
Shared as a benchmark for whether models push back on bad prompts instead of agreeing with them.

Background concepts

Wikipedia on the Gell-Mann amnesia effect
Used to frame why developers trust AI in unfamiliar domains after seeing obvious mistakes in areas they know well.

Tools and projects

opair on GitLab
A proposed agent harness meant to keep the human engaged like a pair-programming partner instead of letting the model disappear into autonomous mode.
CodeLeash pre-commit config
Shared as an example of custom hooks and lint rules used to constrain agent output.
CodeLeash code quality checker
Example script for blocking AI-generated code patterns and feeding actionable errors back to the agent.
CodeLeash tests for code quality checker
Shared to show the linting approach itself was test-driven and integrated into the workflow.

Outage and risk case studies

Our first outage from LLM-written code
Cited as a concrete example of production failure caused by code written with heavy LLM assistance.
Reuters on Amazon cloud outages involving AI tools
Referenced as a reminder that AI-assisted tooling can introduce operational risk even at major infrastructure companies.

When I reject AI code even if it works

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Benchmarks and evaluation

Background concepts

Tools and projects

Outage and risk case studies