Will It Mythos?

AI
Security
Open Source
Developer Tools
Infrastructure

The post builds a benchmark from nine real security bugs that Anthropic said its internal Mythos system found in open source projects. The author used those bugs as a test set for public models, asking them to audit the relevant file with access to the rest of the repo but without being told what the bug was. The point was not to prove Mythos was fake. It was to check whether the gap between Mythos and public models was dramatic, or mostly hype. The answer landed in the middle. Public models did find some of the bugs, but the best results were still only around four out of nine in a single pass, and some leaderboard placements were distorted by models timing out or burning through a cost cap before finishing all cases. Cheap models from DeepSeek and MiMo looked much better than many expected on bang for buck, Gemini underperformed badly in this setup, and later replication tests mentioned in comments suggest Gemma 4 31B may be the strongest self-hostable option the author has tried when given multiple attempts.

If you rely on LLMs for security review, treat model choice and cost as moving targets rather than assuming the biggest US frontier model wins. For now, the practical play is to benchmark your own workflow, keep an eye on self-hostable models like Gemma, and not confuse bug finding with full autonomous exploitation or secure code generation.

June 23, 2026
swelljoe.com
Discuss on HN

Discussion mood

Interested and impressed, but not credulous. People liked having a concrete benchmark and many who had used Fable said it felt materially better than current public models, yet a lot of comments pushed on methodology, cost distortions, and the gap between finding bugs in a benchmark and Anthropic’s broader claims about autonomous exploitation and safety risk.

Key insights

Leaderboard ordering hides the real winners

The ranking as shown overstates GPT-5.5 Pro because it spent its budget after only four cases, so its 2 out of 4 completion rate floats to the top even though several models found more total bugs across the full set. Using a Wilson score style adjustment or simply looking at completed cases changes the picture and makes models like DeepSeek V4, MiMo v2.5 Pro, Opus 4.8, and standard GPT-5.5 look like the practical leaders. That reframes the benchmark from a raw frontier race into a cost-and-throughput tradeoff where the absurdly expensive model is mostly irrelevant for real audit workflows.

Do not consume LLM security benchmarks as a single sorted table. Re-rank them for completed tasks, confidence, latency, and total audit cost before choosing a model for production work.

Attribution:

JumpCrisscross #1
SwellJoe #1

Gemma looks like the sleeper self-hosted option

The strongest new signal beyond the post is that Gemma 4 31B may matter more than the flashy frontier models for real teams. The author says unpublished replication runs had it consistently finding four of nine bugs and sometimes six with multiple attempts, outperforming other self-hostable models tried so far. That also weakens the simple story that US models are uniformly held back by security guardrails, because Google’s closed Gemini setup struggled here while open Gemma reportedly did very well.

If you want controllable security review without depending entirely on hosted APIs, put Gemma 4 31B into your evals now. Self-hosted capability is moving fast enough that it can change your vendor and data-handling strategy, not just your model choice.

Attribution:

SwellJoe #1 #2 #3 #4 #5

Mythos may be more about autonomy than brilliance

A recurring expert read was that Mythos and Fable are notable because they stay on task and execute an end-to-end hunt, not because they possess some magical vulnerability-only intelligence. That matters because the author’s own tests found richer agent harnesses often raised token burn and latency without improving finds, which suggests the winning setup is not just bolting more tools onto a weaker model. The differentiator may be training a model and harness together for long-horizon security work so it knows when to dig, branch, and persist without constant user steering.

If you are building internal AI security tooling, invest less in generic agent feature creep and more in tightly scoped workflows with evals for persistence and task completion. The valuable moat may be specialized autonomy, not another menu of tools.

Attribution:

vessenes #1
jaggederest #1
Tossrock #1
irthomasthomas #1
SwellJoe #1

The benchmark misses exploitation and false positives

The post answers only one slice of the Mythos story, which is bug discovery on files known to contain serious flaws. It does not test whether a model can reliably exploit what it finds, chain issues into a working attack, or stay quiet on clean code. Those missing pieces are exactly where Anthropic’s public risk claims get more consequential, because a model that finds some bugs but floods you with false alarms is a very different operational threat from one that autonomously turns findings into working compromises.

Use this benchmark as a signal about vulnerability discovery only. If you care about security operations or model risk, you still need separate evals for exploitability, false positive rate, and behavior on clean repos.

Attribution:

seizethecheese #1
wrs #1
_alternator_ #1

Guardrails are uneven and product-specific

Comments exposed a practical distinction between model capability and product policy. Gemini in Antigravity reportedly refused this kind of security work, while Google’s open Gemma handled it well in later tests. Fable was blocked before the author could benchmark it. Users outside the US or Europe also pointed out that access restrictions can matter as much as raw capability. In practice, 'which model is best' is often the wrong question. The real question is which model-provider-interface combination will actually let you do the task.

Evaluate the full stack, not just the base model name. Access rules, refusal behavior, region limits, and harness integration can dominate benchmarked capability in day-to-day security work.

Attribution:

SwellJoe #1 #2 #3
wiz21c #1
utopcell #1

Against the grain

Fable may only feel better at the margins

Some hands-on comparisons found Fable and Codex landing in basically the same place on actual solution quality, even when Fable sounded better and felt more polished. In this view, labs are mostly tuning style, harnesses, and user experience while model capability itself is moving only incrementally. That cuts against the dominant excitement that Fable represented a major leap.

Separate 'pleasant to work with' from 'materially more capable' in your evals. If a new model mainly improves prose and flow, optimize your harness before paying a premium for it.

Attribution:

varjag #1
mirsadm #1
_heimdall #1

A lot of nerf talk is just vibes

Several commenters pushed back on the familiar claim that older hosted models are constantly being lobotomized. They argued most users are making retrospective judgments from memory and style changes, not blinded repeatable tests, which makes the whole conversation vulnerable to novelty effects and confirmation bias. That skepticism is useful because many of the strongest Fable testimonials rely on exactly the kind of hard-to-reproduce personal workflow experience that flatters new releases.

Before you conclude a vendor degraded a model, build a small blinded regression suite from your own tasks. Without that, you are mostly measuring your own shifting expectations.

Attribution:

dist-epoch #1
anentropic #1
cpburns2009 #1

Mythos hype may be investor theater

A cynical minority saw the entire Mythos narrative as part capability story and part valuation campaign. In that framing, apocalyptic language around autonomous security threat is as much about shaping courts, regulators, and investors as about describing the actual state of the art. That does not mean the models are weak. It means company messaging is not a clean read on technical reality.

Treat vendor threat narratives as strategic communications, not neutral technical documentation. Base procurement or policy decisions on independent evals and observed workflow gains.

Attribution:

p0w3n3d #1
netcan #1
felipeerias #1
delusional #1

In plain english

agent ↩

A software setup that lets a model take multi-step actions, use tools, and iterate toward a goal instead of only answering a single prompt.

Antigravity ↩

A tool or interface mentioned in comments for running Gemini models, especially in coding or agentic workflows.

DeepSeek ↩

A Chinese AI lab and its models, often discussed as lower-cost alternatives to US frontier models.

Fable ↩

A more guardrailed Anthropic cybersecurity model discussed as a safer or more restricted version of Mythos.

false positive ↩

A case where a system reports a bug or threat that is not actually real.

Gemini ↩

Google's family of AI models and products.

Gemma 4 31B ↩

An open model from Google with about 31 billion parameters, small enough to be self-hosted on high-end local hardware compared with the largest frontier models.

harness ↩

The wrapper software, prompts, tools, and control logic around a model that shape how it performs a task.

MiMo ↩

An AI model or service mentioned as a low-cost alternative in coding workflows.

Mythos ↩

A restricted Anthropic cybersecurity model discussed in the comments as having stronger offensive capabilities than public models.

Opus ↩

Anthropic’s high-end Claude model family used for difficult reasoning and coding tasks.

repo ↩

Repository, a project’s collection of source code and related files, usually stored in version control.

Reference links

Benchmark and methodology references

Binomial proportion confidence interval and Wilson score interval
Suggested as a better way to rank models when some completed only a few benchmark cases before hitting cost caps.
Quesma BinaryAudit benchmark
Offered as another security-oriented benchmark with ROC and pricing visualizations.
The Frontier model rankings
Cited as blinded Elo-style evidence that model quality differences can be measured beyond anecdotal vibes.

Model behavior and AI research

Anthropic on Claude Code and expertise
Used to support the claim that expert users often get better results because they trigger different model behavior.
Anthropic on emotion concepts in models
Referenced to justify the idea that conversational tone and emotional framing may affect model responses.
Anthropic news on Fable and Mythos access
Linked to confirm that access to Fable was shut down broadly, not only for certain regions.

Author benchmark follow-ups and artifacts

Qwen quantization degradation benchmark
Shared by the author as follow-up data on quantization and replication tests for Qwen models.
Gemma promptlab report
Linked by the author as unpublished replication evidence that Gemma 4 31B can find up to six of nine bugs with multiple attempts.
Claude transcripts repository
Provided as the repo containing an exported HTML transcript and code from a long Fable coding session.
36 hours with Fable
Detailed user write-up and transcript offered as evidence for strong Fable performance on a complex coding task.

Tools and projects mentioned

Gemini CLI security extension
Used to show that Google previously exposed a security-specific workflow even though later Gemini interfaces refused similar work.
Ventoy repository
Mentioned as a candidate project someone wanted audited for supply-chain or low-level security risk.
Notes app project
Referenced as the Qt C++ app where one commenter said Fable uniquely found a data corruption bug.

Books and historical references

The Elements of Programming Style
Linked as the source of Kernighan’s law about debugging code that is written too cleverly.
Next Generation issue 26
Used for an analogy comparing current LLM progress hype to 1990s claims of each new 3D engine being photorealistic.

Background references

Wikipedia on seL4
Cited as a rare example of software built around formal verification for strong security guarantees.
Wikipedia on True Will
Mentioned during a side discussion about free will, agency, and how those ideas apply to LLMs.
Dan Luu on productivity and velocity
Referenced in passing to support the author comment that extra courteous prompting costs little for a fast typist.