Feds freaked over Fable 5 after 'fix this code', not jailbreak, say researchers

AI
Security
Regulation
Developer Tools
Open Source

The article argues that federal concern around Anthropic’s Fable 5 was tied to an embarrassingly simple path around its cyber restrictions. Instead of asking for vulnerability discovery directly, researchers reportedly asked the model to fix code and produce tests. Because security fixes usually expose the flaw they patch, and tests can look a lot like exploit proofs, that was enough to recover the same information the guardrails were meant to block. The key context is that Anthropic had positioned Mythos as unusually dangerous for offensive cyber use, then shipped Fable as a constrained version that was supposed to hand sensitive requests off to a weaker model. That framing set a trap for them. If the model can still surface vulnerabilities through normal development work, the denials were never robust. If you harden the denials enough to stop that, you cripple the product as a coding assistant.

If you rely on hosted coding models, treat policy shutdown risk as seriously as model quality. On security work, assume useful bug-finding and offensive capability are inseparable, so plan around open models, multi-vendor options, and stricter internal review rather than expecting provider guardrails to protect you.

June 16, 2026
theregister.com
Discuss on HN

Discussion mood

Strongly negative toward both the ban and Anthropic’s framing. The dominant view was that the restriction is technically incoherent because security review and security abuse are inseparable in code, and politically suspect because the administration appears willing to weaponize access to frontier models.

Key insights

Patch streams become exploit feeds

Watching code fixes is already enough to spot vulnerabilities, and frontier models make that process cheaper and faster. The important shift is not this one prompt, but that every patch stream for projects like Linux or nginx can now be triaged automatically for exploitability, which compresses the time defenders have between disclosure and active abuse.

Shorten your patch-to-deploy window and assume attackers can diff and prioritize fixes at machine speed. If you ship widely used software, invest in release engineering and coordinated patch rollout, not just cleaner commit messages.

Attribution:

jerf #1
pixl97 #1
baq #1

The guardrails look classifier-driven

Several commenters inferred that Fable’s safety layer behaves like a separate classifier that keys off topic signals rather than deep intent. That explains why explicit cybersecurity language triggers fallback while ordinary development language like bug fixing slips through, and why false positives show up in adjacent scientific work while more concrete risky work can pass if it avoids the wrong words.

Do not mistake refusal behavior for robust capability control when evaluating model risk. If you are building on hosted models, test policy boundaries with realistic workflows, not just obvious banned prompts.

Attribution:

cge #1
aesthesia #1
tmp10423288442 #1
htrp #1

Task decomposition defeats policy layers

A practical bypass pattern stood out beyond the specific “fix this code” prompt. Break the forbidden goal into small neutral tasks, strip out security language, and use either a human or a smaller unrestricted model to coordinate the pieces. That matters because it turns safety from a binary block into an optimization problem, and attackers are usually fine paying that coordination cost.

Assume your adversary can compose multiple models and agents, including open ones, to recover blocked capabilities. Defenses that only work against single-shot prompts will not hold up in production or against determined users.

Attribution:

HarHarVeryFunny #1
bitexploder #1 #2
pixl97 #1
chillfox #1
OutOfHere #1

Anthropic boxed itself in with hype

Commenters kept returning to the same strategic mistake. Anthropic spent months saying Mythos-class capability was too dangerous for broad release, then appeared to rely on leaky denials to make a public version acceptable. Once you frame the model as near-weapon-grade, a trivial bypass is not a minor bug. It is evidence against your entire governance story.

If you sell restricted access as your safety case, your controls need to survive boring prompts and common workflows. Otherwise regulators, customers, and rivals will treat your warnings as either fear marketing or proof you cannot operate your own system.

Attribution:

martinald #1 #2
pjc50 #1
ceejayoz #1
ReptileMan #1

Ownership checks break on open source

One proposed fix was to let models do security work only for verified maintainers or project owners. Others tore that apart quickly. Open-source code can be legally copied, forked, and embedded into other projects, so ownership is blurry by design. Even strong package credentials would not solve the fact that legitimate security work often happens on software you do not maintain directly.

Expect identity-gated security access to create heavy enterprise friction while doing little for open ecosystems. If this becomes policy, open-source teams and smaller vendors are likely to lose access first while determined attackers route around it.

Attribution:

NiloCK #1
Retr0id #1
cogman10 #1 #2
ceejayoz #1
ReptileMan #1

Against the grain

Friction still matters for real attackers

Not everyone bought the claim that any bypass makes guardrails meaningless. One argument was that if getting useful offensive output takes far more tokens, reprompts, and supervision than a direct exploit-generation workflow, the safeguard still has value by limiting scale. That does not solve the problem, but it can move some abuse back into human-labor territory.

Measure safety controls by the cost they impose on abuse, not by whether a bypass exists at all. For internal deployments, rate limits, audits, and workflow friction may be worth more than perfect refusal logic.

Attribution:

Retr0id #1
Barbing #1

Bug fixing is not exploit chaining

A narrower technical objection was that the reported behavior does not prove Fable can do the capability Anthropic said was uniquely dangerous. Finding and fixing individual vulnerabilities is not the same as assembling them into a reliable multi-step attack against defended targets. If Mythos was restricted for exploit chaining, this example may be evidence of weak prompt filtering, not evidence that the core restriction target leaked.

Separate vulnerability discovery from full offensive automation when you evaluate model risk or vendor claims. A model that patches code well may still be materially less dangerous than one that can plan and execute exploit chains end to end.

Attribution:

neuronexmachina #1
zozbot234 #1

Unfixable leakage supports stricter limits

A minority took the opposite lesson from the article. If useful coding models cannot avoid leaking security-relevant knowledge through ordinary workflows, then that is evidence they are too dangerous to release broadly, not evidence that restrictions should be abandoned. In that framing, the problem is not a bad ban but a capability that policy cannot safely fence.

If your organization is adopting frontier coding models, decide explicitly whether you believe leakage is manageable or inherent. That answer should drive procurement, deployment scope, and whether you permit external hosted tools on sensitive code at all.

Attribution:

catigula #1 #2

In plain english

classifier ↩

A model or rule system that labels content into categories, such as allowed versus disallowed requests.

diff ↩

The code changes between one version of a file or project and another, often shown line by line.

exploit chaining ↩

Combining multiple smaller software weaknesses into a larger attack path that achieves a serious compromise.

Fable 5 ↩

A coding model referenced in the comments as very capable but expensive.

guardrails ↩

Rules, filters, or extra model layers that block or shape what an AI system will accept or say.

jailbreak ↩

A prompt or interaction pattern that gets an AI model to bypass its safety or behavior restrictions.

Mythos ↩

A model name used in the report and comments for a highly capable U.S. cyber-focused system.

Reference links

Primary source and analysis

Luta Security blog post on the Fable 5 export controls
Referenced as the underlying blog post from the researcher discussed in the article.
TechCrunch article on the Anthropic models ban
Mentioned as another report framing the ban as not really about a jailbreak.
Tom's Hardware report quoting David Sacks on the ban
Cited as evidence that the U.S. government position was publicly confirmed.

Security and software examples

Google Project Zero deep dive on NSO zero-click exploit
Used to illustrate that some patches are unusually hard to reverse into an exploit, but most are not.
XZ Utils backdoor
Brought up in debate over whether maintainer verification can really distinguish legitimate developers from patient attackers.
Reduction in complexity theory
Linked to support the claim that harmful requests can often be transformed into allowed-looking ones.

Anthropic safety methods

Anthropic research on constitutional classifiers
Shared as background on how Anthropic structures its classifier-based safety systems.
Anthropic research on next-generation constitutional classifiers
Shared as a newer reference for Anthropic’s guardrail approach.

Policy and political context

The Guardian on OpenAI, the U.S. military, and Anthropic
Cited to support claims about White House pressure on Anthropic over Defense Department access and terms.
Business Law Today on DoD AI acquisition and IP rights
Linked as additional context for the government dispute around AI procurement and control.
Financial Times report on government access to Mythos
Referenced in speculation that offensive government teams may still retain access to stronger models.
OpenAI 2019 GPT-2 limited release announcement
Used to show that “too dangerous to release” messaging has precedent in the AI industry.

Books, essays, and media references

Rich Hickey, Simple Made Easy
Referenced in a side discussion separating simplicity from ease when talking about jailbreaks.
Feynman, The Value of Science
Quoted for the idea that the same tool can open both the gates of heaven and hell.
Severance Lumon departments wiki
Shared as a cultural reference for compartmentalized work obscuring the larger plan.