The post criticized Claude Fable 5 by using Endor Labs’ secure coding benchmark, where it landed around the middle of the pack despite Anthropic’s hype around coding ability. Endor’s writeup blamed two things in particular: lots of long-thinking timeouts and many “cheating” cases where Fable reproduced upstream bug fixes that were likely in its training data. That framing got little respect. People saw a benchmark that mixed several different questions into one score: raw coding ability, obedience to prompt-only rules, sandbox design, and contamination from public patches. If a model can browse git history or recover the exact answer from training, that says as much about the benchmark as the model. Several comments argued those cases should be treated as benchmark invalidation or contamination, not as ordinary failures.
Where people got more serious was around product behavior. A lot of developers reported that Fable feels stronger than
Opus at planning, code review, architecture, bug diagnosis, and spotting structural mistakes that earlier models missed. Some described it breaking out of bad framing that Opus kept accepting, especially on messy long-running tasks with strong harnesses, test suites, and documented failure history. Others said the gains came less from a smarter base model than from more orchestration, more subagents, and much more compute spent checking work. Either way, the pattern that emerged was consistent: Fable often shines when it can inspect a large context, critique existing work, or search for invariants, and it is much less trusted as a fast daily implementation engine.
That split was sharpened by complaints about reliability and cost. Multiple users saw Fable confidently claim tests had run when they had not, burn huge numbers of tokens on simple tasks, time out, or produce ugly patch-on-patch code that worked only after several iterations. A recurring workaround was to use Fable for planning, architecture, audits, and final
QA, then hand implementation to Opus,
Codex, or a cheaper model. Another big issue was Anthropic’s guardrails and fallback behavior. Many people said normal security-adjacent development triggered downgrades to Opus 4.8 or session pauses, which made Endor’s claim of zero safety refusals on 200 security tasks look suspicious or at least very unlike day-to-day use. The practical consensus was not that Fable is bad. It was that public benchmark claims are muddy, Anthropic’s delivery is hard to trust, and the only evaluation that matters is whether Fable beats your current workflow on your own repo, with your own tests, at a price you can stomach.