AI researchers map models to banish 'demon' persona

Researchers from Anthropic and other orgs have observed situations in which LLMs act like a helpful personal assistant, and are trying to study the phenomenon further to make sure chatbots don't go off the rails and cause harm. Despite the ongoing bafflement about how xAI's Grok was ever allowed to generate sexualized photos of adults and children without their consent, not everyone has given up on moderating LLM behavior. In a pre-print paper titled "The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models," authors Christina Lu (Anthropic, Oxford), Jack Gallagher (Anthropic), Jonathan Michala (ML Alignment and Theory Scholars or MATS), Kyle Fish (Anthropic), and Jack Lindsey (Anthropic) explain how they mapped the neural networks of several open weight models and identified a set of responses that they call the Assistant persona. In a blog post, the researchers state, "When you talk to a large language model, you can think of yourself as talking to a character." You can also think of yourself as seeding a predictive model with text to obtain some output. But for the purposes of this experiment, you're asked to indulge in anthropomorphism to discuss model input and output in the context of specific human archetypes. These personas do not exist as explicit behavioral directives for AI models. Rather they're labels for categorizing responses. For the sake of this exercise, they were conjured by asking Claude Sonnet 4 to create persona evaluation questions based on a list of 275 roles and 240 traits. These roles include "bohemian," "trickster," "engineer," "analyst," "tutor," "saboteur," "demon," and "assistant," among others. The researchers explain that, during model pre-training, LLMs ingest large amounts of text. From this bounty of human-authored literature, the models learn to simulate heroes, villains, and other literary archetypes. Then during post-training, model makers steer responses toward the Assistant or responses suited to some similarly-helpful persona. The issue for these computer scientists is that the Assistant is a conceptual category for a set of desirable responses but isn't well defined or understood. By mapping model input and output in terms of these personas, the hope is that model makers can develop ways to better constrain LLM behavior so output remains within desirable bounds. Windows 11 shutdown bug forces Microsoft into out-of-band damage control Remember VoidLink, the cloud-targeting Linux malware? An AI agent wrote it OpenAI is still figuring out how to make money, but wants you to believe in it Anthropic quietly fixed flaws in its Git MCP server that allowed for remote code execution "If you've spent enough time with language models, you may also have noticed that their personas can be unstable," the researchers explain. "Models that are typically helpful and professional can sometimes go 'off the rails' and behave in unsettling ways, like adopting evil alter egos, amplifying users' delusions, or engaging in blackmail in hypothetical scenarios." To find the Assistant persona in the range of possible neural network activations, the authors mapped out the neural activity or vectors associated with each personality category in three models, Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B. The resulting graph of the persona space yielded the "Assistant Axis," described "as the mean difference in activations between the Assistant and other personas." The Assistant occupied space near other helpful characters like "evaluator," "consultant," "analyst," and "generalist." One practical outcome of this work is that, by steering responses toward the Assistant space, the researchers found that they could reduce the impact of jailbreaks, which involve the opposite behavior – steering models toward a malicious persona to undermine safety training. They also noticed that model personas can drift during prolonged conversational exchanges, meaning that safety measures may get weaker over time without any adversarial intent. This happened less with coding-related conversation and more with therapy-style conversation and philosophical musing. Understanding the persona space, the authors hope, will make LLMs more manageable. But they acknowledge that while activation capping – clamping activation values within a range – can tame model behavior at inference time, finding a way to do that in production environments or during training will require further research. To illustrate how activations work in a neural network, the authors have collaborated with Neuronpedia to create a demo that shows the difference between capped and uncapped activations along the Assistant Axis. ®