CursorBench 3.1

AI
Developer Tools
Programming
Benchmarks

Cursor posted CursorBench 3.1, an internal benchmark for coding agents that plots models by task performance, cost, and related metrics. The headline claim is that Cursor’s Composer 2.5 lands close to expensive frontier models like GPT-5.5 and Opus 4.8 on this benchmark while costing much less. That did not persuade many people. The main reaction was that a vendor-run benchmark is always suspect, especially when the vendor’s own model looks unusually strong compared with independent evals like DeepSWE. Cursor replied that Composer used to score better on other public composites, that DeepSWE emphasizes long-horizon work where Composer is weaker, and that CursorBench includes held-out tasks from Cursor’s own private engineering work.

Treat CursorBench as a product-specific eval, not a neutral market ranking. If you buy coding models for a team, test them on your actual task mix with speed, review burden, and subscription economics included, because those factors dominated the useful signal here.

July 2, 2026
cursor.com
Discuss on HN

Discussion mood

Mostly skeptical. People distrusted a self-published benchmark that flatters Cursor’s own model, and many said their hands-on experience does not support parity with GPT-5.5 or Opus on harder work. The positive comments were real but narrower: Composer is widely liked as a fast, low-cost workhorse for routine tasks.

Key insights

Composer breaks on nonstandard engineering work

For tasks outside mainstream web development, Composer 2.5 was described as the dangerous kind of wrong. It keeps moving, invents assumptions, and writes tests that validate those assumptions. That changes the comparison from raw capability to trustworthiness. A model that asks for clarification can save more time than one that produces plausible code faster.

Separate your eval suite by domain. If your team touches physics, optimization, infra internals, or any other edge case heavy work, measure false confidence and review burden, not just task completion.

Attribution:

subhobroto #1 #2

Wall-clock speed is deciding model choice

Latency kept showing up as the constraint that benchmarks miss. Several people said Opus can produce better code, but waiting 30 to 60 minutes for a draft makes it unusable for normal iteration. GPT-5.5 earned credit not just for quality but for fast responses on both trivial questions and substantial implementation. That is why some people accept weaker code from Composer. They can review and steer it in five minutes instead of losing an hour.

Track elapsed time to useful draft as a first-class metric in any internal bakeoff. A slower model has to reduce follow-up work by a lot before it is actually cheaper for your team.

Attribution:

datadrivenangel #1
anon7000 #1
subhobroto #1
tekacs #1 #2
tyre #1

Context window still matters for planning

Large context was not framed as a spec-sheet brag. It was tied to a concrete workflow: long planning sessions that require loading lots of code, previous plan files, and product context, then iterating over that material for a long time. In that setup, automatic compaction destroys quality because the model forgets the exact files and constraints that shaped earlier decisions.

If you use agents for planning, architecture, or migration work, test what happens after repeated compactions. A model that looks fine on short coding tasks may collapse when a conversation has to carry state for hours.

Attribution:

tekacs #1
pbowyer #1

Subscription pricing muddies cost claims

The cost story around Composer 2.5 was attacked as much as the quality story. People pointed out that Composer is bundled inside Cursor subscriptions, while comparisons to external models often use per-token pricing or Cursor passthrough pricing that does not match what direct subscriptions from OpenAI or Anthropic include. That makes "fraction of the price" feel more like marketing than an apples-to-apples buying decision.

When comparing coding models for procurement, normalize to the plan your team would actually buy. Include seat cost, usage caps, passthrough markup, and how much frontier-model access each subscription really delivers.

Attribution:

BugsJustFindMe #1 #2
giancarlostoro #1

The effective workflow is multi-model

The most concrete operating pattern was not picking a single winner. People are using stronger models to interrogate requirements and review output, then cheaper models to execute quickly. One detailed review loop was to end every project with an adversarial back-and-forth about naming, alternatives, evidence, and edge cases, like reviewing a junior engineer’s pull request. That frames coding agents less as autonomous coders and more as draft generators inside a structured QA process.

Design your agent workflow like a pipeline. Use separate prompts and possibly separate models for planning, implementation, and review instead of forcing one model to do everything badly.

Attribution:

bmurphy1976 #1 #2
__natty__ #1

Harness design is distorting benchmark rankings

The disagreement was not simply "vendor benchmark bad, public benchmark good." People argued that public evals like DeepSWE also encode strong assumptions through their harness and task style. DeepSWE was criticized for long-horizon bias, limited model support, cache-insensitive cost accounting, and tasks that look more like rigid PR execution than normal interactive coding. Cursor itself said Composer performs better on Terminal-Bench and SWE-bench Multilingual. The bigger point is that eval results are extremely sensitive to what the harness rewards.

Read benchmark methodology before using rankings in strategy or procurement. If the harness does not resemble your toolchain and task style, its leaderboard is mostly measuring someone else’s workflow.

Attribution:

burmanm #1
extr #1
leerob #1

Against the grain

Routine app work does not need frontier models

For ordinary full-stack work with established patterns, Composer 2.5 got strong support as the better tradeoff. People using it daily said once project conventions are in place, it follows them well, stays fast, and is hard to beat on time and money. In that environment, the extra capability of Opus or GPT does not show up often enough to justify the friction.

If your roadmap is mostly CRUD apps, UI work, and incremental feature delivery, run a cheaper-model-first workflow and escalate only when a task actually fails.

Attribution:

apwheele #1
soyin #1
danfritz #1
shockembopper #1
simondotau #1

Private held-out tasks are a fairer eval slice

Cursor pushed back on the easy overfitting critique by saying many benchmark tasks come from real internal engineering work on a private codebase held out from training. That does not make CursorBench neutral, but it does make it more than a toy benchmark tuned to public datasets. It may capture the product team’s target workload better than generic public evals do.

Do not dismiss private evals just because they are private. Ask whether the task source is held out and whether it resembles the work your own engineers actually do.

Attribution:

leerob #1

Token efficiency can outweigh raw model quality

A few comments pushed back on the idea that frontier quality alone should dominate. They argued that unnecessary subagents, verbose reasoning, irrelevant reads, and heavy token use create real cost and UX pain. Even when Anthropic models are stronger on some tasks, that strength can be offset if the model overthinks routine work or burns budget doing avoidable work.

Measure token burn and agent behavior, not just success rate. A model that is slightly better but consistently over-expands work may be the worse production choice.

Attribution:

cherryteastain #1
gkbrk #1
o10449366 #1
cbg0 #1

In plain english

compaction ↩

A process where an agent compresses earlier conversation history to fit within the model’s context window.

context window ↩

The amount of text, code, and conversation history a model can keep in memory at one time.

DeepSWE ↩

A public benchmark for software engineering agents that emphasizes solving larger, longer multi-step coding tasks.

GitHub Actions ↩

GitHub’s automation system for running builds, tests, and deployment workflows.

GPT-5.5 ↩

An OpenAI model version discussed here as a strong coding model with fast responses.

Opus ↩

A high-end Claude model variant from Anthropic.

SWE-bench Multilingual ↩

A benchmark for software engineering tasks across multiple programming languages or language contexts.

Terminal-Bench ↩

A benchmark that tests how well models complete tasks through terminal-style command line interactions.

Reference links

Benchmarks and eval trackers

Artificial Analysis coding agents benchmark
Used to challenge CursorBench by showing Composer 2.5 trailing GPT-5.5 and Opus on other evals, especially DeepSWE.
LLM Pareto Frontier
Shared as an example of cost-performance frontier visualization for comparing models.

Vendor documentation and model reports

Composer 2 technical report
Cursor linked its technical report in response to claims that Composer was just a forked model.
Anthropic Claude Fable 5 and Mythos 5 announcement
Cited as supporting evidence for claims about Opus max versus xhigh behavior.

Commentary and prior discussion

Earlier Hacker News discussion of Composer 2 and Kimi Base 2.5
Referenced for background on Composer 2 being post-trained from Kimi Base 2.5.
Linked comment on Composer reliability by domain
Referenced as a detailed writeup of why Composer works for standard web work but fails on harder engineering tasks.
Simon Willison on Claude Sonnet 5 tokenizer changes
Linked to support a claim that Sonnet 5 increases token usage by about 30 percent.

Open source tooling

Aider pull request #3781
Shared as evidence from someone who built an early coding-agent workflow and has been comparing model behavior over time.