CursorBench 3.1
- AI
- Developer Tools
- Programming
- Benchmarks
Cursor posted CursorBench 3.1, an internal benchmark for coding agents that plots models by task performance, cost, and related metrics. The headline claim is that Cursor’s Composer 2.5 lands close to expensive frontier models like GPT-5.5 and Opus 4.8 on this benchmark while costing much less. That did not persuade many people. The main reaction was that a vendor-run benchmark is always suspect, especially when the vendor’s own model looks unusually strong compared with independent evals like DeepSWE. Cursor replied that Composer used to score better on other public composites, that DeepSWE emphasizes long-horizon work where Composer is weaker, and that CursorBench includes held-out tasks from Cursor’s own private engineering work.
Treat CursorBench as a product-specific eval, not a neutral market ranking. If you buy coding models for a team, test them on your actual task mix with speed, review burden, and subscription economics included, because those factors dominated the useful signal here.
- cursor.com
- Discuss on HN