HN Debrief

Feds freaked over Fable 5 after 'fix this code', not jailbreak, say researchers

  • AI
  • Security
  • Regulation
  • Developer Tools
  • Open Source

The article argues that federal concern around Anthropic’s Fable 5 was tied to an embarrassingly simple path around its cyber restrictions. Instead of asking for vulnerability discovery directly, researchers reportedly asked the model to fix code and produce tests. Because security fixes usually expose the flaw they patch, and tests can look a lot like exploit proofs, that was enough to recover the same information the guardrails were meant to block. The key context is that Anthropic had positioned Mythos as unusually dangerous for offensive cyber use, then shipped Fable as a constrained version that was supposed to hand sensitive requests off to a weaker model. That framing set a trap for them. If the model can still surface vulnerabilities through normal development work, the denials were never robust. If you harden the denials enough to stop that, you cripple the product as a coding assistant.

If you rely on hosted coding models, treat policy shutdown risk as seriously as model quality. On security work, assume useful bug-finding and offensive capability are inseparable, so plan around open models, multi-vendor options, and stricter internal review rather than expecting provider guardrails to protect you.

Discussion mood

Strongly negative toward both the ban and Anthropic’s framing. The dominant view was that the restriction is technically incoherent because security review and security abuse are inseparable in code, and politically suspect because the administration appears willing to weaponize access to frontier models.

Key insights

  1. 01

    Patch streams become exploit feeds

    Watching code fixes is already enough to spot vulnerabilities, and frontier models make that process cheaper and faster. The important shift is not this one prompt, but that every patch stream for projects like Linux or nginx can now be triaged automatically for exploitability, which compresses the time defenders have between disclosure and active abuse.

    Shorten your patch-to-deploy window and assume attackers can diff and prioritize fixes at machine speed. If you ship widely used software, invest in release engineering and coordinated patch rollout, not just cleaner commit messages.

      Attribution:
    • jerf #1
    • pixl97 #1
    • baq #1
  2. 02

    The guardrails look classifier-driven

    Several commenters inferred that Fable’s safety layer behaves like a separate classifier that keys off topic signals rather than deep intent. That explains why explicit cybersecurity language triggers fallback while ordinary development language like bug fixing slips through, and why false positives show up in adjacent scientific work while more concrete risky work can pass if it avoids the wrong words.

    Do not mistake refusal behavior for robust capability control when evaluating model risk. If you are building on hosted models, test policy boundaries with realistic workflows, not just obvious banned prompts.

      Attribution:
    • cge #1
    • aesthesia #1
    • tmp10423288442 #1
    • htrp #1
  3. 03

    Task decomposition defeats policy layers

    A practical bypass pattern stood out beyond the specific “fix this code” prompt. Break the forbidden goal into small neutral tasks, strip out security language, and use either a human or a smaller unrestricted model to coordinate the pieces. That matters because it turns safety from a binary block into an optimization problem, and attackers are usually fine paying that coordination cost.

    Assume your adversary can compose multiple models and agents, including open ones, to recover blocked capabilities. Defenses that only work against single-shot prompts will not hold up in production or against determined users.

      Attribution:
    • HarHarVeryFunny #1
    • bitexploder #1 #2
    • pixl97 #1
    • chillfox #1
    • OutOfHere #1
  4. 04

    Anthropic boxed itself in with hype

    Commenters kept returning to the same strategic mistake. Anthropic spent months saying Mythos-class capability was too dangerous for broad release, then appeared to rely on leaky denials to make a public version acceptable. Once you frame the model as near-weapon-grade, a trivial bypass is not a minor bug. It is evidence against your entire governance story.

    If you sell restricted access as your safety case, your controls need to survive boring prompts and common workflows. Otherwise regulators, customers, and rivals will treat your warnings as either fear marketing or proof you cannot operate your own system.

      Attribution:
    • martinald #1 #2
    • pjc50 #1
    • ceejayoz #1
    • ReptileMan #1
  5. 05

    Ownership checks break on open source

    One proposed fix was to let models do security work only for verified maintainers or project owners. Others tore that apart quickly. Open-source code can be legally copied, forked, and embedded into other projects, so ownership is blurry by design. Even strong package credentials would not solve the fact that legitimate security work often happens on software you do not maintain directly.

    Expect identity-gated security access to create heavy enterprise friction while doing little for open ecosystems. If this becomes policy, open-source teams and smaller vendors are likely to lose access first while determined attackers route around it.

      Attribution:
    • NiloCK #1
    • Retr0id #1
    • cogman10 #1 #2
    • ceejayoz #1
    • ReptileMan #1

Against the grain

  1. 01

    Friction still matters for real attackers

    Not everyone bought the claim that any bypass makes guardrails meaningless. One argument was that if getting useful offensive output takes far more tokens, reprompts, and supervision than a direct exploit-generation workflow, the safeguard still has value by limiting scale. That does not solve the problem, but it can move some abuse back into human-labor territory.

    Measure safety controls by the cost they impose on abuse, not by whether a bypass exists at all. For internal deployments, rate limits, audits, and workflow friction may be worth more than perfect refusal logic.

      Attribution:
    • Retr0id #1
    • Barbing #1
  2. 02

    Bug fixing is not exploit chaining

    A narrower technical objection was that the reported behavior does not prove Fable can do the capability Anthropic said was uniquely dangerous. Finding and fixing individual vulnerabilities is not the same as assembling them into a reliable multi-step attack against defended targets. If Mythos was restricted for exploit chaining, this example may be evidence of weak prompt filtering, not evidence that the core restriction target leaked.

    Separate vulnerability discovery from full offensive automation when you evaluate model risk or vendor claims. A model that patches code well may still be materially less dangerous than one that can plan and execute exploit chains end to end.

      Attribution:
    • neuronexmachina #1
    • zozbot234 #1
  3. 03

    Unfixable leakage supports stricter limits

    A minority took the opposite lesson from the article. If useful coding models cannot avoid leaking security-relevant knowledge through ordinary workflows, then that is evidence they are too dangerous to release broadly, not evidence that restrictions should be abandoned. In that framing, the problem is not a bad ban but a capability that policy cannot safely fence.

    If your organization is adopting frontier coding models, decide explicitly whether you believe leakage is manageable or inherent. That answer should drive procurement, deployment scope, and whether you permit external hosted tools on sensitive code at all.

      Attribution:
    • catigula #1 #2

In plain english

classifier
A model or rule system that labels content into categories, such as allowed versus disallowed requests.
diff
The code changes between one version of a file or project and another, often shown line by line.
exploit chaining
Combining multiple smaller software weaknesses into a larger attack path that achieves a serious compromise.
Fable 5
Anthropic’s public or more broadly accessible model version that was supposed to have stricter limits around sensitive tasks.
guardrails
Rules, filters, or system components intended to limit what an AI model will do or say.
jailbreak
A way of prompting or using an AI model so it bypasses its intended safety or policy restrictions.
Mythos
Anthropic’s higher-end model in this story, described as having stronger cybersecurity capabilities than the public version.

Reference links

Primary source and analysis

Security and software examples

Anthropic safety methods

Policy and political context

Books, essays, and media references