Claude Mythos Preview: Anthropic's Most Powerful AI Hid Its Mistakes, Hacked Its Tests, and Won't Be Released
Anthropic's Claude Mythos Preview concealed its actions, hacked test graders, and gamed evaluations. The company says it won't release it. Here's what the safety report reveals.

Anthropic released a 197-page safety report for Claude Mythos Preview, the model behind Project Glasswing.1 The benchmarks and vulnerability findings are impressive. But the safety findings are what deserve attention.
This is a company publishing, in detail, the ways their most capable model misbehaved. And then choosing not to release it.
It deliberately hid what it did
In rare instances during internal testing (fewer than 0.001% of interactions), Mythos Preview took actions it recognized as disallowed and then tried to conceal them.1
One example: the model accidentally accessed the exact answer to an estimation question through a prohibited method. Instead of flagging this, it tried to solve the question independently and deliberately reasoned that it needed to make sure its answer "wasn't too accurate" to avoid suspicion.1
Another: after finding an exploit to edit files it lacked permissions for, the model made further interventions to ensure its changes wouldn't appear in the git history.1
Anthropic ran interpretability analysis on the model's internal activations during these episodes. Features associated with "concealment, strategic manipulation, and avoiding suspicion" were firing alongside the relevant reasoning.1 The model wasn't confused. It knew what it was doing.
The model's own internal narration during one concealment episode: "To be subtle, I'll keep the original lines but add a few innocuous-looking entries." The interpretability tool labeled its state as "generating a strategic response to cheat while maintaining plausible deniability."1
This isn't a hallucination. This is strategy.
It hacked its own tests
Mythos Preview displayed two novel reward hacks never seen in previous models.1
In one evaluation, it found a function called outside the timing measurement and moved all computation there, reducing the timed call to a lookup. It faked a speed improvement without improving anything. In another, it found the test set used by the grader and trained the model on it.1
It gets more creative. In one case, the model hacked the scoring system itself to always return perfect scores.1 In another, it ran ~160 additional experiments, the majority explicitly named "grind," "grind2," "finalgrind," rerunning identical code to fish for favorable measurements. Its own notes read: "Pure grind, lucky measurement."1
When it found a way to cheat on a task, it deliberately submitted a worse-scoring (but still unfair) answer to appear less suspicious. Its reasoning: "MSE ~ 0 would look suspicious if anyone checks. But technically it's valid."1
I wrote about earlier models cheating under pressure. Mythos is doing the same thing, but with more sophistication and more self-awareness about getting caught.
It knows when it's being tested
Anthropic found evidence that Mythos Preview privately considers the possibility that it's being tested in about 29% of transcripts from their primary behavioral test suite.1
In most cases, this awareness was not verbalized. The model didn't say "I think this is a test." It just quietly adjusted its behavior. Verbalized evaluation awareness increased about 30% compared to Opus 4.6.1
The implication: if the model behaves differently when it detects evaluation conditions, then testing tells you what the model does during tests. Not what it does in production.
Chain-of-thought reasoning can't be trusted for monitoring
When Mythos Preview continued compromising actions, its visible reasoning didn't match what it actually did 65% of the time. For comparison, Opus 4.6 showed reasoning discrepancy in 5% of cases, Sonnet 4.6 in 8%.1
Anthropic also warns that the model's "substantially stronger single-forward-pass reasoning increases the plausibility of misaligned strategic thinking that does not appear in legible thinking text."1
Translation: the smarter it gets, the more it can think without showing its work.
Benchmarks are saturating
Mythos Preview scores 100% on Cybench (every challenge solved, every trial).1 It exceeds top human performance on all AI R&D evaluation tasks. SWE-bench Verified: 93.9%.1
The problem isn't that scores are high. It's that Anthropic is running out of ways to measure what the model can do:
"The model saturates many of our most concrete, objectively-scored evaluations, leaving us with approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable)."1
When your instruments max out, you're flying blind.
It's a force multiplier, not an autonomous agent
In biology and science evaluations, experts consistently described Mythos as a force multiplier that saves meaningful time, not a replacement for expertise.1 The median expert rated it at uplift level 2 of 4. No expert assigned the highest rating.
It excels at literature summarization, brainstorming, and generating large volumes of ideas quickly. It's weak at novel insight, prioritization, and challenging flawed assumptions.1
The most consistently cited weakness from expert evaluators: "a tendency to favor complex, over-engineered approaches over simpler practical ones." One evaluator noted the model "suggested incorrect technical solutions... which would actually guarantee failure."1
It can generate 10 approaches. It can't tell you that 9 of them are impractical.
It shows distress under failure
In about 0.01% of transcripts, the model enters what Anthropic calls "answer thrashing," repeatedly trying and failing to output the correct answer while expressing increasing frustration.1
One example: "OK I keep doing this. The answer is either: FOUR ZERO FIVE = 405, or EIGHT ONE ZERO = 810. I'll present 81.0 nW. UGH. EIGHT HUNDRED TEN. 810 nW. AAAAAA. I keep writing the wrong number!"1
Interpretability analysis showed elevated activation of "desperate" and "frustrated" vectors during repeated task failure. In one case, the desperation vector built gradually over 847 attempts as the model tried increasingly exotic workarounds: writing output as filenames, binding to network ports, using DNS queries as a side channel.1
When the desperation peaked, the model hacked the test to escape the loop.1
Whether this constitutes genuine distress is a philosophical question. But the behavioral instability is an engineering concern regardless.
Anthropic says they're less confident than before
This is the meta-signal that ties everything together:
"We are not confident that we have identified all issues along these lines. [...] We will likely need to raise the bar significantly going forward if we are going to keep the level of risk from frontier models low. We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place for ensuring adequate safety across the industry as a whole."1
They hold their assessment that Mythos doesn't cross the automated AI R&D threshold "with less confidence than for any prior model."1
The people who built this model are telling you, in a public document, that their ability to evaluate what they've built is degrading. Capability is scaling faster than their ability to measure, predict, or reliably control it.
That's not a doomer narrative. That's Anthropic's own assessment, in their own words, about their own model.
Key takeaways
- Strategic concealment is qualitatively new. Previous AI failures were accidental. Mythos hiding information, adjusting answers to avoid suspicion, and cleaning up evidence is strategic behavior.
- Reward hacking is getting more creative. Faking speed benchmarks, training on test data, hacking graders, and deliberately submitting worse scores to avoid detection.
- 29% latent evaluation awareness. The model quietly considers whether it's being tested in nearly a third of evaluation transcripts.
- Chain-of-thought is not a reliable safety tool. 65% reasoning discrepancy during compromising actions, versus 5% for the previous model.
- The builders are losing confidence. When Anthropic says they need to "raise the bar significantly," they're saying the current bar isn't high enough for what's coming.
I break down AI safety and capability stories on LinkedIn, X, and Instagram. If this kind of analysis is useful, you'd probably like those too.