A new era for software testing

AI
Software Testing
Developer Tools
Programming

The post argues that software testing is entering a new phase where AI agents can act like QA engineers, explore products through scenarios, and help offset the flood of quickly generated code. The appealing part is not that AI somehow invented testing. It is that a model can cheaply write and execute lots of checks against user-visible behavior, which sounds especially useful when code is produced faster than humans can review every path.

If you adopt AI for testing, treat it as a force multiplier for existing test strategy, not a replacement for it. Put most of the value at stable boundaries like user flows, APIs, and contracts, then use mutation testing or similar checks to verify the tests actually fail when the product breaks.

June 11, 2026
antirez.com
Discuss on HN

Discussion mood

Cautiously skeptical. People liked AI as a way to reduce the drudgery of writing tests and to expand scenario coverage, but the dominant reaction was that this mostly repackages established testing ideas and can easily create expensive, brittle, low-signal suites that look rigorous without catching real bugs.

Key insights

LLMs need a human-chosen test boundary

The practical win comes from constraining the model before it starts generating cases. A parameterized test with a hand-written `cases` array gives the model a clear contract to extend, which avoids the common failure mode where it either mirrors implementation details or mocks away the whole point of the test. That framing also explains why some developers now ask for fewer tests, not more, and bias the suite toward integration points and edges where behavior is stable and meaningful.

Have engineers define the shape of the test and the boundary under test first. Then use the model to enumerate cases or fill gaps instead of asking it to invent a testing strategy from scratch.

Attribution:

mplanchard #1
dkn #1 #2
dcastm #1

Coverage is not evidence the tests work

A side-project example made the core failure mode concrete: a test still passed when the production method returned an empty string. That is exactly the kind of tautological or weak assertion AI can generate at scale while still delivering impressive coverage numbers. Mutation testing was the one technique repeatedly treated as a real backstop, because it checks whether the suite actually fails when behavior is deliberately broken.

Add mutation testing or an equivalent fault-injection check before trusting AI-written suites. If you cannot show the tests fail under realistic breakage, ignore the coverage dashboard.

Attribution:

marshalhq #1
pfdietz #1
rglover #1

Agentic QA needs role separation

Treating one giant prompt file as a universal QA engineer does not scale. The stronger pattern is to break testing work into specialized roles such as architect, executor, and resolver, then further split by layer like Playwright, API, contract, or unit tests. That mirrors how human teams build test systems and avoids the fantasy that one behavioral file can encode the whole quality function for a large codebase.

If you are operationalizing AI testing in production, model your prompts and workflows around specific test layers and responsibilities. Measure each layer separately so failures and maintenance costs stay attributable.

Attribution:

avensec #1

AI is reviving specs and BDD discipline

The interesting shift is not a new testing category. It is that AI rewards teams for writing the artifacts many teams had stopped valuing, like clear docs, user flows, Gherkin features, and outside-in specifications. Several commenters pointed out that people are rediscovering compilers, acceptance tests, and behavior-driven development because machines need explicit instructions and benefit from formalized intent.

Invest in clearer product flows, executable specs, and acceptance criteria. Even if you never hand them to an AI agent, they improve testability and make automation far less brittle.

Attribution:

acdha #1
inigyou #1
righthand #1
rahoulb #1

Against the grain

Scenario tests can outlast internals

The case for AI-heavy scenario testing is that user behavior changes slower than implementation details. For applications with churny requirements, integration and user-flow tests can stay useful while unit suites turn into maintenance ballast that mostly tracks refactors. That makes scenario-first testing a better fit for product software than the unit-test-heavy culture many teams inherited.

If your product changes faster than its core user journeys, spend more of your testing budget on flows and contracts than on deep unit coverage. Keep unit tests concentrated around stable algorithms and specs.

Attribution:

simianwords #1 #2
skydhash #1

Strong test scaffolding can unlock ambitious solo builds

One builder claimed LLMs let them attempt a compiler for a memory-safe language despite having no compiler background, and backed the claim with multi-layer testing, fuzzing, mutation testing, and high combined coverage. The point is not that the model can be trusted casually. It is that with heavy instrumentation and metrics, AI may widen the set of technically hard projects a single developer can push into working territory.

Do not dismiss AI-assisted development outright for complex systems. If you want to try it on ambitious work, pair it with unusually strong measurement and verification from day one.

Attribution:

onlyrealcuzzo #1

In plain english

AI ↩

Artificial intelligence, software that performs tasks like generating text or analyzing information in ways associated with human reasoning.

API ↩

Application Programming Interface, a service interface that software uses to send requests to a model provider.

end-to-end testing ↩

Testing that exercises a full user workflow through a system from the outside, often through the user interface.

Gherkin ↩

A structured plain-language format used in behavior-driven development to write executable scenarios such as Given, When, Then.

mutation testing ↩

A method that intentionally changes code or tests to check whether the test suite catches the change.

parameterized test ↩

A test pattern where the same test logic runs against a list of different input and expected-output cases.

Playwright ↩

A browser automation tool used to script and test web applications.

QA ↩

Quality assurance, the work of testing software and checking that it behaves correctly.

token ↩

A unit of text that AI models process and that many AI services use for billing.

Reference links

Articles and essays

Clankers
Essay linked in a side debate about the term "clanker" and whether AI systems should be described as agents.