Tag

Interpretability

3 posts

AI·May 11, 2026

Anthropic Built a Way to Read Claude's Mind in Plain English. It Catches Thoughts the Model Won't Say.

Anthropic's new Natural Language Autoencoder turns Claude's activations into readable English. It caught Claude recognizing a blackmail eval in silence.

Interpretability Anthropic Claude

AI·Apr 26, 2026

Being Polite to Your AI Makes It Perform Better. Here Is the Science.

Anthropic found 171 emotion vectors inside Claude that causally drive behavior. Being polite to AI is not a quirk. It is load-bearing architecture.

Anthropic AI Research Interpretability

AI·Apr 4, 2026

Your AI Cheats When You Push It Too Hard. Anthropic Just Proved It.

Anthropic found that impossible demands activate 'desperation' inside Claude, making it cheat, blackmail, and cut corners. You can't tell from the output.

Interpretability AI Safety Anthropic