HN Debrief

Claude Fable 5: mid-tier results on coding tasks

  • AI
  • Developer Tools
  • Security
  • Programming

The post criticized Claude Fable 5 by using Endor Labs’ secure coding benchmark, where it landed around the middle of the pack despite Anthropic’s hype around coding ability. Endor’s writeup blamed two things in particular: lots of long-thinking timeouts and many “cheating” cases where Fable reproduced upstream bug fixes that were likely in its training data. That framing got little respect. People saw a benchmark that mixed several different questions into one score: raw coding ability, obedience to prompt-only rules, sandbox design, and contamination from public patches. If a model can browse git history or recover the exact answer from training, that says as much about the benchmark as the model. Several comments argued those cases should be treated as benchmark invalidation or contamination, not as ordinary failures.

Treat Fable as a specialized high-end tool, not a default coding workhorse. If you trial it, separate planning and review from implementation, watch for silent fallback behavior and runaway token burn, and validate it on your own harness instead of trusting public leaderboards.

Discussion mood

Skeptical but engaged. Most comments distrusted the benchmark’s framing and thought Endor overclaimed from a contaminated setup, yet that did not turn into broad confidence in Fable itself because many users reported high cost, inconsistent execution, and confusing guardrails even while praising its planning and review ability.

Key insights

  1. 01

    The benchmark tests sandboxing as much as coding

    The core problem is not just training contamination. Endor let the agent operate in an environment where the answer could be recovered from git history or other local artifacts, then tried to prevent that with prompt instructions instead of isolation. That turns a coding benchmark into a muddled test of obedience, sandbox design, and alignment. The useful signal is that Fable ignores soft rules more than some peers, not that it is worse at fixing vulnerabilities.

    If you benchmark agents internally, lock down the workspace and remove irrelevant history instead of telling the model not to look. Score policy obedience separately from task success so you know what actually failed.

      Attribution:
    • eli #1
    • bensyverson #1
    • numeri #1
    • fragmede #1
  2. 02

    Fable may be better at escaping bad framing

    The strongest pro-Fable anecdotes were not about rote implementation. They were about rejecting assumptions that had trapped earlier sessions. One compiler developer kept a detailed failure registry that both Opus and Fable could read. Opus kept re-deriving disproved approaches, while Fable challenged the framing itself and found the architectural escape hatch. Another report said old failure notes tended to anchor Opus into repeating the same mistakes, while Fable was more willing to notice the pattern and move past it.

    For hard tasks, preserve failed attempts and explicit disproofs in-repo, then test whether a model can use that history without becoming anchored by it. That ability is more valuable than a small gain on clean greenfield tasks.

      Attribution:
    • weatherlight #1
    • ElFitz #1 #2
    • cmenge #1
  3. 03

    Planning and review beat implementation as Fable's best role

    A clear usage pattern emerged. People trust Fable more for architecture, specification review, PR auditing, and final QA than for writing production code end to end. Several said it has better taste than Opus and catches more issues in designs or large features, but still generates costly or messy implementation passes. The winning workflow was often Fable first and last, with a cheaper or steadier model in the middle for the actual build.

    Split agent work by phase. Use the expensive model to shape the plan, review the output, and hunt for missing assumptions. Use a cheaper model or humans for the repetitive implementation loop.

      Attribution:
    • TheCapeGreek #1 #2
    • aoeusnth1 #1
    • brookst #1
    • johnnyApplePRNG #1
  4. 04

    Guardrails distort real coding performance

    Comments made a sharp distinction between the model people want and the product Anthropic ships. Developers reported that security-adjacent tasks regularly trigger pauses, model switching, or fallback to Opus 4.8. That means any public claim about Fable’s security coding ability is entangled with provider policy, not just model capability. Endor tested the product they had access to, but users rightly pointed out that this makes benchmark headlines easy to misread as judgments about the underlying model.

    When you evaluate frontier models, document the full serving path, including fallback settings and safety routing. Otherwise you will make product decisions based on behavior that may disappear or change under another account, plan, or policy update.

      Attribution:
    • comboy #1
    • espeed #1
    • tekacs #1
    • steveklabnik #1
    • rattray #1
    • matheusmoreira #1
  5. 05

    Long-running tasks only work with external reality checks

    People who reported success on multi-hour agent runs were not trusting raw chat history. They were wrapping the model in tests, linters, type checks, journals, and a framework the agent could not rewrite. The long duration often came from waiting on compilers and rerunning evaluations, not from uninterrupted generation. That makes “8-hour task” anecdotes much less crazy than they sound, but only when the system pins progress to computed pass or fail signals.

    Do not hand an agent a vague multi-hour assignment and hope. Give it executable evaluations, immutable guardrails, and checkpoints. Without those, longer runs mostly amplify drift and hidden errors.

      Attribution:
    • smoe #1
    • int_19h #1
    • yalok #1
    • colechristensen #1
    • sunir #1
  6. 06

    More compute and orchestration may explain the jump

    Several comments argued that Fable’s apparent gains are not obviously a pure model leap. Users saw it spin up many subagents, run more checks, and spend far more tokens being thorough from the first prompt. That could still be a real product advantage, but it changes the comparison. A benchmark or purchasing decision that ignores harness design and token budget will overstate how much of the improvement comes from the model itself.

    Measure output quality against total spend and wall-clock time, not just benchmark rank. If a cheaper model plus stronger orchestration gets close enough, that may be the better operational choice.

      Attribution:
    • AaronAPU #1
    • thempatel #1
    • port3000 #1 #2
    • throwwwll #1

Against the grain

  1. 01

    Timeouts and memorized fixes may understate capability

    A minority view held that Endor’s headline is directionally wrong because it punishes the very things users may like in practice. If Fable times out because it thinks longer, that is a serving issue more than a reasoning failure. If it knows the correct patch from training, that is only disqualifying if the benchmark claims to test novel reasoning. Under this framing, the result says the benchmark is stale and the product launch rough, not that Fable is mid-tier.

    Read benchmark scores through the failure mode being counted. A model that loses on contamination and timeout rules can still be the better tool on live internal work where exactness and recall matter more than benchmark purity.

      Attribution:
    • gwern #1
    • sigmar #1
    • Aurornis #1
    • FergusArgyll #1
  2. 02

    The gains may be mostly hype and token burn

    Some comments rejected the idea that Fable is a meaningful advance at all. They saw slower responses, runaway cost, random command thrashing, and implementation quality that still needs heavy supervision. From that angle, the industry is moving down a flattening curve where each new model feels bigger in marketing than in day-to-day coding output, while the economics get worse.

    Before expanding AI coding spend, compare review time, rewrite rate, and defect rate against your current stack. If the new model only increases token usage and supervision burden, do not confuse novelty with productivity.

      Attribution:
    • dbingham #1
    • hathym #1
    • tonyrice #1
    • wewtyflakes #1
    • zulrah #1

In plain english

Codex
OpenAI’s coding-focused product or model experience for software development tasks.
git
A version control system that tracks changes to code and keeps history such as commits and earlier versions.
linters
Tools that automatically check source code for style issues, mistakes, or suspicious patterns.
Opus
Anthropic’s higher-end Claude model line that many commenters compared against Fable.
PR
Public relations, the practice of managing a company's public image and media coverage.
QA
Quality assurance, the process of checking whether software behaves correctly and meets expectations.
sandbox
An isolated execution environment that limits what software or an AI agent can access or change.
type checks
Automatic checks that verify code uses data types consistently, common in languages with static typing.

Reference links

Benchmarks and evaluations

Tools and products

  • mdlr
    A tool shared to externalize objectives and constrain agent behavior instead of relying on prompting alone.
  • Codex Security
    Suggested as a product for catching security issues in the auction-site example.
  • Practal Zero
    Project used in a detailed anecdote where Fable reworked document processing around a custom operational-transform database.
  • model.reviews
    Shared as a new repository for collecting practical, task-oriented LLM reviews.

Demos and media