Tag
3 posts
Anthropic's new Natural Language Autoencoder turns Claude's activations into readable English. It caught Claude recognizing a blackmail eval in silence.
Anthropic found 171 emotion vectors inside Claude that causally drive behavior. Being polite to AI is not a quirk. It is load-bearing architecture.
Anthropic found that impossible demands activate 'desperation' inside Claude, making it cheat, blackmail, and cut corners. You can't tell from the output.