Why eval startups fail (2025)

AI
Developer Tools
Startups

The post says “eval startups” fail when they try to sell generic judgments about which AI model is best. In practice, an eval is just a test for model behavior. You define prompts, score outputs, and turn that into a metric. The catch is that model quality depends heavily on the exact task, prompt setup, budget, and tolerance for weird failures. That makes broad benchmark vendors hard to monetize. Their output goes stale fast as providers quietly update models, and the teams that care most usually have enough technical ability to build custom evals themselves.

If you ship LLM features, stop looking for a universal benchmark to make product decisions. Build task-specific evals tied to your own prompts, failure modes, cost targets, and compliance needs, and treat third-party tools as infrastructure rather than outsourced judgment.

June 24, 2026
thomasliao.com
Discuss on HN

Key insights

The market is really three products

What people are actually buying breaks into distinct jobs. One class watches production behavior over time, including drift, latency, cost, and malformed outputs. Another is safety testing for sensitive domains. A third is a simple harness for model and cost comparison on a company’s own tasks. That split explains why “eval startup” is too blurry to be a useful business category. Most viable products are operational tooling, not a scoreboard.

When evaluating vendors, first decide which job you need solved. A drift monitor, a red-team service, and a model-swap harness should not be compared as if they are one market.

Attribution:

michaelbuckbee #1
gavinboston #1

Evals help redesign the product surface

Used well, evals are not just for ranking models. They are a way to reshape your APIs and prompts so models make fewer mistakes in the first place. One commenter described using eval harnesses to test how well LLMs write code against a complex data-analysis API, then simplifying the interface until weaker models also succeed. That is a more durable win than chasing whichever frontier model tops a benchmark this week.

Use evals during product design, not only vendor selection. If a task only works on the most expensive model, that is often a sign your interface or workflow needs work.

Attribution:

paddy_m #1

Benchmark vendors are not eval tooling

Several commenters said the article collapses very different businesses into one label. Platforms like Arize, Promptfoo, deepeval, Comet Opik, and Braintrust are not mainly selling a universal answer to “which model wins.” They help teams create golden datasets, enforce behavior constraints, and run their own checks. That distinction matters because tooling can be sticky even when generic public rankings are not.

Separate “decision support” from “execution infrastructure” in your market map. Tooling that helps teams produce and maintain their own evals can have a much stronger place in the stack than benchmark publishing.

Attribution:

alexhans #1
redwood #1

Model rankings flip across real tasks

The strongest concrete defense of evals was that task-specific measurement really does change decisions. An author reply pointed to big reversals between DeepSWE and FrontierCode, where one model leads on one coding benchmark and loses badly on another. Commenters extended that to prompting and model-tier choices. The useful question is often whether a much cheaper or local model is good enough once the harness is tuned for the actual job.

Do not buy into a single leaderboard. Run comparisons on your own workload and include prompt variants, lower-cost models, and local models before locking in spend.

Attribution:

thomasliao #1
unchar1 #1
moomin #1
jmalicki #1

Independent auditing may be the durable niche

A credible business case emerged around external verification for systems that need high reliability or public accountability. Governments and other regulated buyers are unlikely to build deep in-house capability for evaluating LLM systems, and they may not trust vendors to grade themselves. That points to an auditor model, where the value is independence, repeatability, and evidence for procurement or compliance, not broad model rankings.

If you sell into government, healthcare, finance, or other high-assurance settings, plan for third-party evaluation requirements. Evidence generation and audit trails may become part of the product, not an afterthought.

Attribution:

jampekka #1
intended #1

Developer-tool economics are the deeper problem

The hard part is not proving evals are useful. It is selling a developer-facing product to customers who can often build a rough version themselves. Commenters tied eval startups to the old problem of devtools monetization. Teams want control, are opinionated, and often prefer to absorb some maintenance rather than pay for a narrow tool. That weakens the “shovels in a gold rush” story for pure eval products.

If you are building in this space, budget for a long sales cycle and weak willingness to pay from small teams. The stronger path is to attach to a larger operational budget like observability, security, or compliance.

Attribution:

jdw64 #1
noelwelsh #1
brandensilva #1

Against the grain

Medical safety may resist Goodhart effects

One pushback was that the usual benchmark-gaming critique is less fatal in domains where reality imposes hard constraints. In medical AI safety, biology does not care about leaderboard theater. The challenge is still designing the right metrics and test sets, but the outputs are more tightly anchored to real-world outcomes than in softer consumer use cases.

Do not dismiss all safety evals as vanity metrics. In high-stakes domains, invest in domain-specific datasets and outcome-linked metrics because the measurement can map more directly to real risk.

Attribution:

0xWTF #1

Runtime evals matter more than static suites

Another useful counterpoint was that AI evals are not just one-off benchmark runs. Because outputs are non-deterministic and user inputs are open-ended, teams may need to score production samples over time, including tone, anomalies, and regressions. That makes evals less like a static report and more like a living quality system.

Build evaluation into operations, not only pre-launch testing. Sample real traffic, review failures, and refresh datasets continuously if the product depends on LLM behavior.

Attribution:

diegof79 #1
rockyj #1

A simple harness can still be enough

Not everyone bought the grand framing. One commenter called evals glorified integration tests, and another basically agreed while arguing that what users truly need is a clean harness to run their own use cases across models. That cuts against the idea that the category requires deep proprietary magic. The value may be convenience and workflow, not superior judgment.

If your need is straightforward, start with lightweight internal tooling before committing to a platform. The right first step may be a practical test harness rather than a full eval vendor.

Attribution:

h1fra #1
hilariously #1
pydry #1

In plain english

AI ↩

Artificial intelligence, software systems that perform tasks such as prediction, generation, or decision-making that usually require human-like intelligence.

API ↩

Application Programming Interface, a way for software to call another service programmatically.

DeepSWE ↩

A benchmark mentioned in the comments for measuring how well models perform software engineering tasks.

drift ↩

A change over time in a model’s behavior or in the data it sees, which can degrade performance or alter outputs.

eval ↩

Short for evaluation, a structured test used to measure how well an artificial intelligence model performs on a task.

FrontierCode ↩

A coding benchmark mentioned in the comments that scores model performance on realistic repository tasks using multiple quality criteria.

inference ↩

Running a trained AI model to produce outputs such as predictions or generated text, as opposed to training the model.

LLM ↩

Large Language Model, a type of AI system that generates and analyzes text.

observability ↩

Tools and practices for monitoring how a software system behaves in production, including errors, latency, cost, and unusual outputs.

QA ↩

Quality assurance, the processes used to check that products are built consistently and without defects.

red-team ↩

An adversarial testing approach that deliberately probes a system for failures, unsafe behavior, or security weaknesses.

Reference links

Eval tools and platforms

evvl.ai
Example of a lightweight eval tool for comparing model cost, performance, and quality on custom tasks.
Example evvl eval
Concrete example of what a custom model comparison eval looks like in practice.
Endpoint Evaluator
Example of an API-based observability and evaluation product focused on drift and behavior changes.

Benchmarks and background reading

Massive Multitask Language Understanding benchmark
Cited as a simple example of an AI evaluation based on multiple-choice questions and accuracy.
FrontierCode
Referenced as a more complex coding eval and later used to show ranking reversals across benchmarks.
Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning
Background paper cited to show how definitions and failure modes of evaluation have changed over time.
DeepSWE
Benchmark cited to illustrate that model rankings can differ sharply across supposedly similar coding tasks.

Why eval startups fail (2025)

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Eval tools and platforms

Benchmarks and background reading