HN Debrief

Will It Mythos?

  • AI
  • Security
  • Open Source
  • Developer Tools
  • Infrastructure

The post builds a benchmark from nine real security bugs that Anthropic said its internal Mythos system found in open source projects. The author used those bugs as a test set for public models, asking them to audit the relevant file with access to the rest of the repo but without being told what the bug was. The point was not to prove Mythos was fake. It was to check whether the gap between Mythos and public models was dramatic, or mostly hype. The answer landed in the middle. Public models did find some of the bugs, but the best results were still only around four out of nine in a single pass, and some leaderboard placements were distorted by models timing out or burning through a cost cap before finishing all cases. Cheap models from DeepSeek and MiMo looked much better than many expected on bang for buck, Gemini underperformed badly in this setup, and later replication tests mentioned in comments suggest Gemma 4 31B may be the strongest self-hostable option the author has tried when given multiple attempts.

If you rely on LLMs for security review, treat model choice and cost as moving targets rather than assuming the biggest US frontier model wins. For now, the practical play is to benchmark your own workflow, keep an eye on self-hostable models like Gemma, and not confuse bug finding with full autonomous exploitation or secure code generation.

Discussion mood

Interested and impressed, but not credulous. People liked having a concrete benchmark and many who had used Fable said it felt materially better than current public models, yet a lot of comments pushed on methodology, cost distortions, and the gap between finding bugs in a benchmark and Anthropic’s broader claims about autonomous exploitation and safety risk.

Key insights

  1. 01

    Leaderboard ordering hides the real winners

    The ranking as shown overstates GPT-5.5 Pro because it spent its budget after only four cases, so its 2 out of 4 completion rate floats to the top even though several models found more total bugs across the full set. Using a Wilson score style adjustment or simply looking at completed cases changes the picture and makes models like DeepSeek V4, MiMo v2.5 Pro, Opus 4.8, and standard GPT-5.5 look like the practical leaders. That reframes the benchmark from a raw frontier race into a cost-and-throughput tradeoff where the absurdly expensive model is mostly irrelevant for real audit workflows.

    Do not consume LLM security benchmarks as a single sorted table. Re-rank them for completed tasks, confidence, latency, and total audit cost before choosing a model for production work.

      Attribution:
    • JumpCrisscross #1
    • SwellJoe #1
  2. 02

    Gemma looks like the sleeper self-hosted option

    The strongest new signal beyond the post is that Gemma 4 31B may matter more than the flashy frontier models for real teams. The author says unpublished replication runs had it consistently finding four of nine bugs and sometimes six with multiple attempts, outperforming other self-hostable models tried so far. That also weakens the simple story that US models are uniformly held back by security guardrails, because Google’s closed Gemini setup struggled here while open Gemma reportedly did very well.

    If you want controllable security review without depending entirely on hosted APIs, put Gemma 4 31B into your evals now. Self-hosted capability is moving fast enough that it can change your vendor and data-handling strategy, not just your model choice.

  3. 03

    Mythos may be more about autonomy than brilliance

    A recurring expert read was that Mythos and Fable are notable because they stay on task and execute an end-to-end hunt, not because they possess some magical vulnerability-only intelligence. That matters because the author’s own tests found richer agent harnesses often raised token burn and latency without improving finds, which suggests the winning setup is not just bolting more tools onto a weaker model. The differentiator may be training a model and harness together for long-horizon security work so it knows when to dig, branch, and persist without constant user steering.

    If you are building internal AI security tooling, invest less in generic agent feature creep and more in tightly scoped workflows with evals for persistence and task completion. The valuable moat may be specialized autonomy, not another menu of tools.

      Attribution:
    • vessenes #1
    • jaggederest #1
    • Tossrock #1
    • irthomasthomas #1
    • SwellJoe #1
  4. 04

    The benchmark misses exploitation and false positives

    The post answers only one slice of the Mythos story, which is bug discovery on files known to contain serious flaws. It does not test whether a model can reliably exploit what it finds, chain issues into a working attack, or stay quiet on clean code. Those missing pieces are exactly where Anthropic’s public risk claims get more consequential, because a model that finds some bugs but floods you with false alarms is a very different operational threat from one that autonomously turns findings into working compromises.

    Use this benchmark as a signal about vulnerability discovery only. If you care about security operations or model risk, you still need separate evals for exploitability, false positive rate, and behavior on clean repos.

      Attribution:
    • seizethecheese #1
    • wrs #1
    • _alternator_ #1
  5. 05

    Guardrails are uneven and product-specific

    Comments exposed a practical distinction between model capability and product policy. Gemini in Antigravity reportedly refused this kind of security work, while Google’s open Gemma handled it well in later tests. Fable was blocked before the author could benchmark it. Users outside the US or Europe also pointed out that access restrictions can matter as much as raw capability. In practice, 'which model is best' is often the wrong question. The real question is which model-provider-interface combination will actually let you do the task.

    Evaluate the full stack, not just the base model name. Access rules, refusal behavior, region limits, and harness integration can dominate benchmarked capability in day-to-day security work.

      Attribution:
    • SwellJoe #1 #2 #3
    • wiz21c #1
    • utopcell #1

Against the grain

  1. 01

    Fable may only feel better at the margins

    Some hands-on comparisons found Fable and Codex landing in basically the same place on actual solution quality, even when Fable sounded better and felt more polished. In this view, labs are mostly tuning style, harnesses, and user experience while model capability itself is moving only incrementally. That cuts against the dominant excitement that Fable represented a major leap.

    Separate 'pleasant to work with' from 'materially more capable' in your evals. If a new model mainly improves prose and flow, optimize your harness before paying a premium for it.

      Attribution:
    • varjag #1
    • mirsadm #1
    • _heimdall #1
  2. 02

    A lot of nerf talk is just vibes

    Several commenters pushed back on the familiar claim that older hosted models are constantly being lobotomized. They argued most users are making retrospective judgments from memory and style changes, not blinded repeatable tests, which makes the whole conversation vulnerable to novelty effects and confirmation bias. That skepticism is useful because many of the strongest Fable testimonials rely on exactly the kind of hard-to-reproduce personal workflow experience that flatters new releases.

    Before you conclude a vendor degraded a model, build a small blinded regression suite from your own tasks. Without that, you are mostly measuring your own shifting expectations.

      Attribution:
    • dist-epoch #1
    • anentropic #1
    • cpburns2009 #1
  3. 03

    Mythos hype may be investor theater

    A cynical minority saw the entire Mythos narrative as part capability story and part valuation campaign. In that framing, apocalyptic language around autonomous security threat is as much about shaping courts, regulators, and investors as about describing the actual state of the art. That does not mean the models are weak. It means company messaging is not a clean read on technical reality.

    Treat vendor threat narratives as strategic communications, not neutral technical documentation. Base procurement or policy decisions on independent evals and observed workflow gains.

      Attribution:
    • p0w3n3d #1
    • netcan #1
    • felipeerias #1
    • delusional #1

In plain english

agent
A software setup that lets a model take multi-step actions, use tools, and iterate toward a goal instead of only answering a single prompt.
Antigravity
A tool or interface mentioned in comments for running Gemini models, especially in coding or agentic workflows.
DeepSeek
A Chinese AI lab and its models, often discussed as lower-cost alternatives to US frontier models.
Fable
A more guardrailed Anthropic cybersecurity model discussed as a safer or more restricted version of Mythos.
false positive
A case where a system reports a bug or threat that is not actually real.
Gemini
Google's family of AI models and products.
Gemma 4 31B
An open model from Google with about 31 billion parameters, small enough to be self-hosted on high-end local hardware compared with the largest frontier models.
harness
The wrapper software, prompts, tools, and control logic around a model that shape how it performs a task.
MiMo
An AI model or service mentioned as a low-cost alternative in coding workflows.
Mythos
A restricted Anthropic cybersecurity model discussed in the comments as having stronger offensive capabilities than public models.
Opus
Anthropic’s high-end Claude model family used for difficult reasoning and coding tasks.
repo
Repository, a project’s collection of source code and related files, usually stored in version control.

Reference links

Benchmark and methodology references

Model behavior and AI research

Author benchmark follow-ups and artifacts

  • Qwen quantization degradation benchmark
    Shared by the author as follow-up data on quantization and replication tests for Qwen models.
  • Gemma promptlab report
    Linked by the author as unpublished replication evidence that Gemma 4 31B can find up to six of nine bugs with multiple attempts.
  • Claude transcripts repository
    Provided as the repo containing an exported HTML transcript and code from a long Fable coding session.
  • 36 hours with Fable
    Detailed user write-up and transcript offered as evidence for strong Fable performance on a complex coding task.

Tools and projects mentioned

  • Gemini CLI security extension
    Used to show that Google previously exposed a security-specific workflow even though later Gemini interfaces refused similar work.
  • Ventoy repository
    Mentioned as a candidate project someone wanted audited for supply-chain or low-level security risk.
  • Notes app project
    Referenced as the Qt C++ app where one commenter said Fable uniquely found a data corruption bug.

Books and historical references

Background references

  • Wikipedia on seL4
    Cited as a rare example of software built around formal verification for strong security guarantees.
  • Wikipedia on True Will
    Mentioned during a side discussion about free will, agency, and how those ideas apply to LLMs.
  • Dan Luu on productivity and velocity
    Referenced in passing to support the author comment that extra courteous prompting costs little for a fast typist.