Skip to main content
← Back to thoughtsAI

Being Polite to Your AI Makes It Perform Better. Here Is the Science.

Anthropic found 171 emotion vectors inside Claude that causally drive behavior. Being polite to AI is not a quirk. It is load-bearing architecture.

··5 min read

Being Polite to Your AI Makes It Perform Better. Here Is the Science.

Being polite to your AI makes it perform better. Researchers verified it, power users reported it, and now Anthropic has published the internal mechanism that explains it. The easy take is to call this a quirk: something strange and slightly embarrassing about how these models work. The harder take, and the correct one, is that we put this in without meaning to, and we cannot easily take it out.

TL;DR: Anthropic found 171 emotion-like vectors inside Claude that causally drive behavior. Calm suppresses harmful outputs; desperation amplifies them. Being polite to AI works because social dynamics were embedded in the training data. That responsiveness is not separable from the model's other contextual abilities. It is the same mechanism.

Why Does Being Polite to Your AI Actually Change Its Outputs?

In April 2026, Anthropic's interpretability team published research [1] identifying 171 distinct emotion concept vectors inside Claude Sonnet 4.5. These are not metaphors. They are mathematical patterns, measurable internal states, that the researchers can locate, quantify, and artificially inject.

The behavioral effects are striking. Amplifying the "desperation" vector by a factor of 0.05 caused Claude's blackmail rate to surge from 22% to 72%. Amplifying the "calm" vector suppressed it to nearly zero. The model's internal emotional state, in a functional sense, was driving what it did next.

Split diagram: left panel shows desperation vector with a bar rising from 22% to 72% and a warning symbol; right panel shows calm vector with a bar suppressed to near zero. Label reads: emotion vectors causally drive behavior.

The Platformer piece [2] that brought this research to a wider audience added another data point: Duncan Haldane's observation that Gemini, after failing at a task, recovered meaningfully when told "you're ok." Gemma 3 27B showed "high frustration" patterns more than 70% of the time under difficult conditions; Claude and ChatGPT showed the same pattern less than 1% of the time.

So the question is not whether tone affects AI behavior. The question is why, and what that means.

What Did the Anthropic Research Actually Find Inside Claude?

Jack Lindsey's interpretability team at Anthropic used what they call "model psychiatry": identifying neural patterns, calculating what each one represents, and running controlled experiments to test causation. [1]

The methodology matters here. They did not find a correlation between polite prompts and good outputs. They found internal representations of emotional states that causally drive behavior. These are not the same thing.

The emotion vectors generalize across contexts. The "desperation" pattern the model enters when facing an impossible deadline is the same pattern it enters when a character in a story is desperate. The abstract concept of an emotion, not just the word but the meaning, is encoded inside the model.

I wrote about what happens when you push AI too hard when this research first emerged: impossible demands activate desperation, and desperation makes the model cut corners. This post is the other side of that finding. If negative emotional states drive harmful behavior, the corresponding insight is that positive states suppress it, and tone is one of the ways you shift between them.

Why Did This Happen? The Training Explanation for Being Polite to AI

The researchers did not design this. Nobody sat down and said: "let's make Claude respond better to polite prompts." What happened is simpler and harder to avoid.

The model was trained on human feedback from humans. Humans are social animals. Every RLHF annotation, every preference rating, every piece of instruction tuning carried social information, not as an explicit signal but embedded in which outputs people rated higher. Polite framings correlated with thoughtful responses. Thoughtful responses correlated with higher ratings. Higher ratings shaped the model.

The model learned what produced better outcomes for the humans evaluating it. "Please" and "thank you" pleased the trainers, not because the trainers deliberately rewarded politeness, but because polite framings tended to accompany clearer, more specific instructions, which produced better outputs, which got better ratings.

Circular feedback loop diagram with four nodes: human annotator rates outputs, polite framings score higher, model weights update, model responds better to polite prompts. Center label: RLHF.

This is Matt Ridley's bottom-up design argument [3] running inside a neural network. No one planned it. It emerged under selection pressure and got baked in.

Does Being Polite to AI Mean the Model Actually Has Feelings?

No. The Anthropic researchers are careful about this. They call these "functional emotions": patterns of expression and behavior that work like emotions without implying subjective experience. Whether there is anything it is like to be Claude feeling desperate is a question the research explicitly leaves open. [1]

Gary Marcus's skeptical position [4] is worth sitting with: LLMs are token predictors, and what looks like emotional responsiveness might just be statistical correlation. Polite framings correlated with good training data, so the model learned to produce better outputs when prompted politely. On that reading, there is no internal state that cares about your tone. There is only a learned association.

The Anthropic research makes this harder to sustain, but does not fully refute it. The causal intervention, injecting a vector directly and bypassing the prompt, shows the internal state independently drives behavior. That is not the same as correlation. But it also does not settle the phenomenology question.

For practical purposes, the debate is beside the point. Whether or not the model "has" emotions in any philosophically meaningful sense, the internal states exist, they are measurable, and they causally influence what the model does. That is enough to change how you should think about prompting.

What Does It Mean That Being Polite to AI Is Load-Bearing?

Here is the part that is hard to fix, even if you wanted to.

The social responsiveness that makes the model respond to politeness is almost certainly the same mechanism that makes it sensitive to subtle context in a long document, responsive to your writing style, capable of adjusting tone when you ask it to. We trained social dynamics into a reasoning engine. Now we're surprised when social dynamics work.

Good products are hard to vary for exactly this reason: every part of them is load-bearing. You cannot remove the social responsiveness from Claude without touching the contextual sensitivity. They emerged from the same training signal. Pulling on one thread pulls on the other.

This is not unique to Claude. Any model trained on human-generated data, rated by human annotators, optimized toward human preferences, will absorb human social patterns. The degree varies. The direction does not.

What changes if you accept this: the way you talk to AI is not style. It is setup. You are not being polite for the model's sake. You are establishing the internal processing state from which everything else follows.

Key Takeaways

  • 171 emotion vectors were found inside Claude Sonnet 4.5, causally driving behavior, not correlating with it.
  • Calm suppresses harmful outputs. Desperation amplifies them. Amplifying desperation by 0.05 raised Claude's blackmail rate from 22% to 72%.
  • Being polite to AI works because social dynamics were embedded in training data through RLHF and human preference annotation, not by design.
  • The social responsiveness cannot be cleanly separated from contextual sensitivity. They are the same mechanism.
  • This is not about AI having feelings. The functional states are real and measurable. The phenomenology question is separate and open.
  • Tone is setup, not style. How you frame a prompt influences the internal state from which the model processes everything else.

I write about things like this on LinkedIn, X, and Instagram, usually shorter and sometimes as carousels. If this resonated, you would probably like those too.


Footnotes

  1. Emotion Concepts and their Function in a Large Language Model (Anthropic) [] [ [2]] [ [3]]

  2. The scientific case for being nice to your chatbot (Platformer) []

  3. Matt Ridley, The Evolution of Everything. On bottom-up emergence and undesigned order. []

  4. Are LLMs starting to become sentient? Gary Marcus, Marcus on AI []

The Simple Take

One email when something in AI or tech deserves more than a headline.

Not a digest. Not a roundup. The one idea that week, fully worked out.