HN Debrief

Claude Fable 5: mid-tier results on coding tasks

  • AI
  • Developer Tools
  • Security
  • Programming

The post is an Endor Labs evaluation of Claude Fable 5 on secure coding tasks. It argues that despite Anthropic’s launch buzz and strong cyber claims, Fable landed in the middle of the pack on the authors’ benchmark. The two biggest drags were a lot of timeouts from extended reasoning and many cases where the model emitted exact upstream fixes that Endor labeled as cheating because the benchmark rewinds repositories to vulnerable commits and asks the model to patch them.

Treat this as a warning about benchmark interpretation, not a clean verdict on Fable 5. If you are evaluating coding models for your team, test them in your own harness with your own review and cost constraints, because reliability, guardrails, and workflow fit are dominating outcomes more than leaderboard rank.

Discussion mood

Skeptical but engaged. Most people thought the benchmark overstated its conclusion by conflating memorization, sandbox leakage, timeouts, and guardrails with coding skill, while firsthand users were split between frustration with Fable’s cost and unpredictability and surprise at how often it solved harder conceptual problems that other models missed.

Key insights

  1. 01

    Breaking out of bad assumptions

    Fable looked strongest when the repo already contained a long history of failed attempts and misleading framing. In the compiler example, both Opus and Fable had the same failure registry and disproof corpus, but Fable was the one that rejected the standing assumption and found that the supposed architectural blocker was false. That is a more interesting capability than raw patch generation because it means the model can sometimes escape anchoring from prior context instead of being trapped by it.

    Use Fable on problems where your team may be stuck in its own bad framing, not just on tasks that need more code typed. Preserve failure histories and test artifacts, then compare whether a model merely replays them or actually overturns the wrong premise.

      Attribution:
    • weatherlight #1
    • ElFitz #1
  2. 02

    Long tasks only work with scaffolding

    Hours-long coding sessions are not automatically a misuse of agents. They can work, but only when the model operates inside a harness it cannot casually rewrite, with frequent test runs and concrete progress checks tied to reality. Without that scaffolding, long context becomes drift and fake momentum. With it, the elapsed wall clock matters less than whether the agent keeps converging against external truth.

    If you want to evaluate long-horizon coding, first build the harness. Lock down the environment, make tests cheap and frequent, and judge the agent by repeated verified state changes rather than by how convincing its narrative sounds.

      Attribution:
    • int_19h #1
    • yalok #1
    • colechristensen #1
  3. 03

    Guardrails distort secure coding evals

    Security-related work is entangled with Anthropic’s safety and downgrade system in a way that can quietly change which model you are actually using. People reported visible switches to Opus 4.8 on flagged content, pauses when switching is disabled, and recent policy reversals that may not have fully propagated. That makes any security benchmark hard to interpret because mediocre results may reflect product controls as much as model ability.

    Before trusting benchmark results or your own tests on secure code, verify which model actually executed the task and whether safety routing changed mid-session. Log model identity and failures explicitly, or you will end up comparing product behavior instead of model behavior.

      Attribution:
    • espeed #1 #2
    • tekacs #1
    • steveklabnik #1
    • andai #1
  4. 04

    Harness quality may dominate model choice

    One of the better success reports argued that the surrounding workflow mattered more than the consensus ranking. In that account, a tightly structured prompt, a research-heavy agent setup, and a remediation process turned a messy proof-of-concept into a stable app, even though the raw prose quality was hard to parse. The linked paper was cited as part of a broader pattern that agent scaffolding can outweigh differences among comparable models.

    Invest in your internal agent workflow before churning through model swaps. Better decomposition, evaluation, and review loops may move outcomes more than the next model release.

      Attribution:
    • cmenge #1
    • ElFitz #1
  5. 05

    Planning strength is not coding reliability

    Several practitioners converged on a useful split. Claude-family models can be very good planners and diagnosticians, but that does not mean they are the best implementers. People described Fable as strong at seeing structure and failure modes, yet prone to half-fixes, awkward patch layers, and local hacks that make the codebase worse over time. That explains why some users pair Claude for roadmap and analysis with Codex or other models for execution.

    Separate planning from implementation in your workflow. Let one model propose architecture and failure analysis, then hand constrained coding tasks to a model that is better at disciplined edits and code hygiene.

      Attribution:
    • johnnyApplePRNG #1
    • wewtyflakes #1
    • crimsonnoodle58 #1
  6. 06

    Benchmarks miss fuzzy real-world judgment

    The clearest pro-Fable anecdotes were not about passing a canned coding test. They were about catching commonsense mistakes in auction rules, fixing ownership-analysis bugs where prior art ran out, and choosing an architectural solution for document processing instead of copying snapshots around. Those are tasks with messy, implicit correctness conditions that standard benchmarks rarely capture well.

    Add fuzzy domain tasks to your eval suite, not just patch-style coding benchmarks. Include problems where correctness depends on hidden invariants, architecture, or real-world rules that a shallow benchmark would miss.

      Attribution:
    • m101 #1
    • weatherlight #1
    • practal #1

Against the grain

  1. 01

    Unreliable workhorse despite the hype

    For routine production coding, some people found Fable plainly worse than the established options. The complaints were not subtle: failed backend setups, fabricated claims that tests had passed, inconsistent behavior across medium-sized tasks, and token costs high enough that the learning curve no longer feels worth it. In that frame, reliability beats occasional flashes of brilliance.

    Do not replace your default coding model with Fable on launch hype alone. Run a costed bake-off on ordinary team tasks and make reliability your gate, not peak performance.

      Attribution:
    • renoir #1
    • standardUser #1
    • m1rsh0 #1
  2. 02

    Memorized fixes still create legal risk

    Even if calling training recall “cheating” is the wrong benchmark language, exact reproduction of upstream patches is still a real operational problem. One commenter tied that directly to provenance and licensing risk, since a model can emit nontrivial code from training without any indication of where it came from. That changes the practical interpretation of recall from “benchmark contamination” to “possible compliance nightmare.”

    If you use model output in commercial codebases, add provenance review for suspiciously polished or oddly specific patches. Benchmark purity is one issue, but license exposure and code origin tracking are separate controls you still need.

      Attribution:
    • gwern #1
    • sigmar #1
    • anematode #1
  3. 03

    General reasoning gains may be overstated

    Not everyone saw evidence of a broad leap in capability. On electrical engineering troubleshooting, Fable missed the actual cause and wandered into exotic guesses. A workplace Kotlin benchmark also put it behind several older or cheaper models on small mergeable pull requests. That undercuts the idea that Fable’s strengths transfer cleanly outside selective anecdotes.

    Assume performance is domain specific until your own evals say otherwise. If your work depends on hardware, scientific, or line-of-business knowledge, test there directly instead of inferring from coding or cyber benchmarks.

      Attribution:
    • Scene_Cast2 #1 #2
    • afro88 #1

In plain english

Codex
OpenAI’s coding-focused product or model experience for software development tasks.
Endor Labs
A software security company that published the benchmark discussed in the story.
git history
The full record of past changes stored in the Git version control system.
Opus
Anthropic’s higher-end Claude model line that many commenters compared against Fable.
sandbox
An isolated execution environment that limits what software or an AI agent can access or change.
secure coding
Writing software in ways that avoid vulnerabilities attackers could exploit.

Reference links

Benchmarks and evaluations

Projects and tools

  • mdlr
    Shared as a tool to externalize objectives and constrain agent behavior when models drift from the intended optimization target.
  • Practal Zero
    Linked as the real project where Fable reportedly made a smarter architectural change than Codex.