Launch HN: TesterArmy (YC P26) – Agents that test web and mobile apps

AI
Developer Tools
Startups
Testing

TesterArmy is pitching an agentic testing platform for web and mobile apps that runs end-to-end checks before deploy and in production. Instead of writing and maintaining scripts, teams describe flows in natural language and the service drives the app, handles things like auth and email one-time passcodes, then reports failures through Slack or Discord. The founders position it less as “AI writes your tests” and more as a cloud QA worker with a harness built to keep runs stable. They say this matters most for brittle UI automation and dynamic flows such as AI chat, where selectors, waits, and mocks break down fast.

If your team already has strong internal LLM-based test generation, this product is competing with a workflow you may not need to outsource. If your bottleneck is auth, OTP, mobile binaries, flaky selectors, and dynamic flows that scripted tests keep missing, agentic QA is worth a trial, but you should demand evidence on stability, run time, and cost before rolling it into CI.

June 18, 2026
tester.army
Discuss on HN

Key insights

Missing benchmarks is the biggest gap

For a testing product, the absence of benchmarks is not a cosmetic issue. It leaves buyers unable to judge pass rate stability, token efficiency, or whether the agent actually beats Playwright MCP or other LLM-driven setups on cost and speed. The founders explained that they cache trajectories and optimize context, but they still asked customers to trust internal tuning instead of publishing hard numbers.

Do not evaluate this category on demos alone. Ask for measured rerun consistency, average runtime by test length, and real cost per successful assertion before you put it on a release path.

Attribution:

pranshuchittora #1 #2
okwasniewski #1 #2

PR validation is the clearest wedge

The strongest concrete use case was not broad production monitoring. It was validating pull requests against preview environments. The founders said they generate a test plan from code changes, and one user reported that it often replaces the manual smoke check before shipping. That is a tighter and more credible workflow than trying to replace an entire regression suite at once.

If you trial agentic testing, start with preview deployments and changed-path smoke tests. That gives you a bounded place to compare it against human QA and scripted checks without betting the whole CI pipeline.

Attribution:

msencenb #1
okwasniewski #1
pensono #1

Mobile support is a real differentiator

The web story sounded crowded, but the mobile story had more teeth. TesterArmy claims it runs native app binaries in the cloud, supports iOS and Android, and uses a hybrid of vision plus accessibility APIs rather than pure vision. People also pushed on simulator coverage, Apple platform breadth, and bad-network scenarios, which shows where mobile teams will judge the product. Right now the answer is partial support, not full lab coverage.

If your hardest testing problems are in native mobile, this is where the product may earn its keep. Check device and platform coverage first, then ask about connectivity simulation and environment variance before assuming it can replace your current setup.

Attribution:

tcoff91 #1
okwasniewski #1 #2 #3 #4
yohguy #1
peterspath #1
jaggederest #1

The domain choice is hurting enterprise adoption

Several comments turned a naming nit into an operational issue. Using a .army domain triggered spam filtering and corporate firewall problems, and the founders confirmed they had already seen emails land in spam at larger companies and planned to move to .com. For a testing vendor that needs access, alerts, and trust inside enterprise environments, this is not a branding footnote.

For startup teams selling into enterprises, infrastructure trust signals start with boring things like domains and deliverability. Clean those up early because they affect adoption before buyers ever test the product.

Attribution:

iknownthing #1
okwasniewski #1 #2
thih9 #1
tootubular #1

Against the grain

In-house agents may already be enough

For teams that already use Claude, Codex, or Opus to generate evals and run background agents inside their own sandbox, external testing can feel like extra indirection. In that setup, the coding model already knows the code paths and can spin up targeted checks faster than a third-party service can discover them from the outside.

Before adding a vendor, compare against your current internal loop honestly. If your own coding agents can generate and run useful evals inside the repo, the bar for outsourcing should be much higher than “natural language tests.”

Attribution:

dbbk #1
Obertr #1

Flaky agent runs can be worse than no tests

The harshest criticism was that an unreliable agent produces the most dangerous kind of signal. If a test system fails for reasons unrelated to product regressions, teams stop trusting it and either ignore alerts or waste time chasing ghosts. That cuts directly against the product’s core promise of confidence.

Treat false positives as a first-class metric during any pilot. If the tool cannot stay quiet when nothing changed, it will not survive contact with an engineering team.

Attribution:

antifarben #1
Lionga #1

Security concerns limit outsourcing appetite

Some people were uneasy about handing QA execution to an outside service at all. The founders answered that the product can work from a URL or app build without codebase access, but that does not remove concerns around credentials, test accounts, inbox access, and production-like environments. For many teams, those controls decide the purchase more than prompt quality does.

Map the trust boundary before you test the product. Decide whether a URL-only mode is enough for your use case, and what secrets, inboxes, or environments you are willing to expose to a third party.

Attribution:

zuzululu #1
okwasniewski #1

In plain english

accessibility APIs ↩

Software interfaces that expose UI structure and elements to assistive technologies, which can also help automation tools interact with apps more reliably.

CRUD ↩

Create, Read, Update, Delete, the basic operations used in many simple business applications that manage stored data.

Cypress ↩

A web testing tool used to automate browser-based end-to-end tests.

LLM ↩

Large language model, a machine learning system that generates and edits text or code from prompts.

OTP ↩

One-Time Password, a short code often used for login verification or two-factor authentication.

Playwright ↩

Playwright is a browser automation framework commonly used for end-to-end web testing.

Playwright MCP ↩

A setup that combines Playwright with a model context protocol style interface so an AI agent can control a browser and inspect results.

Reference links

Product links

TesterArmy
The launched product, an agentic testing platform for web and mobile apps
TesterArmy demo video
Product demo showing how the service works

Competing or comparable tools

Revyl
Mentioned as a point of comparison for mobile agent-driven testing

Launch HN: TesterArmy (YC P26) – Agents that test web and mobile apps

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Product links

Competing or comparable tools