Tag
1 post
Anthropic's new Natural Language Autoencoder turns Claude's activations into readable English. It caught Claude recognizing a blackmail eval in silence.