Tag

Alignment

1 post

AI·May 11, 2026

Anthropic Built a Way to Read Claude's Mind in Plain English. It Catches Thoughts the Model Won't Say.

Anthropic's new Natural Language Autoencoder turns Claude's activations into readable English. It caught Claude recognizing a blackmail eval in silence.

Interpretability Anthropic Claude