The post builds a benchmark from nine real security bugs that Anthropic said its internal Mythos system found in open source projects. The author used those bugs as a test set for public models, asking them to audit the relevant file with access to the rest of the repo but without being told what the bug was. The point was not to prove Mythos was fake. It was to check whether the gap between Mythos and public models was dramatic, or mostly hype. The answer landed in the middle. Public models did find some of the bugs, but the best results were still only around four out of nine in a single pass, and some leaderboard placements were distorted by models timing out or burning through a cost cap before finishing all cases. Cheap models from DeepSeek and MiMo looked much better than many expected on bang for buck, Gemini underperformed badly in this setup, and later replication tests mentioned in comments suggest Gemma 4 31B may be the strongest self-hostable option the author has tried when given multiple attempts.
The sharpest takeaway from the comments is that this benchmark is measuring a narrower problem than Anthropic’s scariest Mythos claims. It tests vulnerability discovery on known-bad files, not fully autonomous repo-wide hunting, exploitation, or
false positive restraint. People also kept tripping over the methodology, and the author repeatedly clarified that the tested models were not pointed at the bug during evaluation. Only the judge model was given the bug location when constructing and scoring the corpus. There was also a useful reality check on agents and harnesses. In the author’s runs, a fuller
agent loop with more tools did not improve results and often made them slower and more expensive, which pushes against the default assumption that more scaffolding always helps. Several experienced users argued that Mythos and
Fable’s real edge may be less raw IQ than training for persistence, self-direction, and end-to-end bug hunting, which would fit why public models can sometimes identify the same bug when aimed carefully but still fail to autonomously do the whole job.
The mood around Fable itself was intense. Many people who had used it described it as clearly better than current
Opus or other coding models on hard, open-ended work, especially in areas like reverse engineering, geometry, concurrency, and large existing codebases. But there was no consensus on why. Some saw a genuine step change in reasoning. Others thought a lot of the perceived jump comes from
harness design, model personality, or the familiar cycle where older hosted models seem to get worse as providers shift compute and push users toward newer versions. That skepticism did not erase the main signal here. Even with all the caveats, this benchmark made one thing concrete: frontier capability in practical software security is spreading faster than the marketing suggests, and it is no longer safe to assume the only serious options come from the top US labs.