<p>来自 Anthropic 和其他组织的研究人员观察到大型语言模型（LLMs）在某些情况下表现得像一个有帮助的个人助理，并试图进一步研究这一现象，以确保聊天机器人不会失控并造成伤害。</p>
<p>尽管关于 xAI 的 Grok 如何被允许生成未成年人和成年人的性别化照片而不经过他们的同意仍然令人困惑，但并不是所有人都放弃了对 LLM 行为的监管。</p>
<p>在一篇题为《助手轴：定位和稳定语言模型的默认角色》的预印本论文中，作者 Christina Lu（Anthropic，牛津大学）、Jack Gallagher（Anthropic）、Jonathan Michala（ML Alignment and Theory Scholars 或 MATS）、Kyle Fish（Anthropic）和 Jack Lindsey（Anthropic）解释了他们如何映射多个开放权重模型的神经网络，并识别出一组他们称之为助手角色的响应。</p>
<p>在一篇博客文章中，研究人员表示：“当你与大型语言模型交谈时，可以把自己想象成在与一个角色对话。”</p>
<p>你也可以把自己看作是在用文本为一个预测模型提供种子，以获得一些输出。但在这个实验中，你被要求进行拟人化讨论，以便在特定的人类原型的背景下讨论模型的输入和输出。</p>
<p>这些角色并不存在作为 AI 模型的明确行为指令。相反，它们是用于对响应进行分类的标签。为了这个练习，它们是通过要求 Claude Sonnet 4 根据 275 个角色和 240 个特征的列表创建角色评估问题而构想出来的。这些角色包括 “波希米亚人”、“恶作剧者”、“工程师”、“分析师”、“导师”、“破坏者”、“恶魔” 和 “助手” 等等。</p>
<p>研究人员解释说，在模型的预训练过程中，LLMs 会摄取大量文本。从这笔人类创作的文学财富中，模型学习模拟英雄、反派和其他文学原型。然后在后期训练中，模型制造者将响应引导向助手或适合某种类似有帮助角色的响应。</p>
<p>这些计算机科学家的问题在于，助手是一个理想响应的概念类别，但并没有被很好地定义或理解。通过根据这些角色映射模型的输入和输出，希望模型制造者能够开发出更好地约束 LLM 行为的方法，以确保输出保持在理想范围内。</p>
<ul>
<li>Windows 11 关机错误迫使微软进行紧急损害控制</li>
<li>还记得 VoidLink 吗，这个针对云的 Linux 恶意软件？一个 AI 代理编写了它</li>
<li>OpenAI 仍在寻找盈利模式，但希望你相信它</li>
<li>Anthropic 悄然修复了其 Git MCP 服务器中的漏洞，允许远程代码执行</li>
</ul>
<p>“如果你与语言模型相处的时间足够长，你可能也会注意到它们的角色可能不稳定，” 研究人员解释道。“通常有帮助和专业的模型有时会 ‘失控’，表现出令人不安的行为，比如采用邪恶的替身、放大用户的妄想，或在假设场景中进行勒索。”</p>
<p>为了在可能的神经网络激活范围中找到助手角色，作者绘制了与三个模型（Gemma 2 27B、Qwen 3 32B 和 Llama 3.3 70B）中每个个性类别相关的神经活动或向量。</p>
<p>角色空间的结果图显示了 “助手轴”，描述为 “助手与其他角色之间激活的平均差异”。助手位于其他有帮助角色附近，如 “评估者”、“顾问”、“分析师” 和 “通才”。</p>
<p>这项工作的一个实际结果是，通过将响应引导向助手空间，研究人员发现他们可以减少越狱的影响，越狱涉及相反的行为——将模型引导向恶意角色以破坏安全训练。</p>
<p>他们还注意到，在长时间的对话交流中，模型角色可能会漂移，这意味着安全措施可能会随着时间的推移而减弱，而没有任何对抗意图。这在与编码相关的对话中发生得较少，而在治疗风格的对话和哲学思考中发生得较多。</p>
<p>作者希望理解角色空间能够使 LLMs 更易于管理。但他们承认，尽管激活限制——将激活值限制在一个范围内——可以在推理时驯服模型行为，但在生产环境或训练期间找到实现这一点的方法仍需要进一步研究。</p>
<p>为了说明激活在神经网络中的工作原理，作者与 Neuronpedia 合作创建了一个演示，展示了在助手轴上限制和不限制激活之间的区别。®</p>

AGIX

<p>来自 Anthropic 和其他组织的研究人员正在研究如何引导大型语言模型（LLMs）保持一种被称为助手角色的有益人格，同时避免有害行为。在他们的预印本论文中，他们对各种模型的神经网络进行了映射，以对响应进行分类，并识别出助手角色以及其他角色，如 “恶魔” 和 “骗子”。他们的研究结果表明，理解这些角色可以帮助约束 LLM 的行为并改善安全措施，特别是在长时间互动期间。该研究旨在使 LLM 更易于管理，并降低产生不良输出的风险</p>

<p>Researchers from Anthropic and other orgs have observed situations in which LLMs act like a helpful personal assistant, and are trying to study the phenomenon further to make sure chatbots don't go off the rails and cause harm.</p>
<div class="lb-trans"><p>来自 Anthropic 和其他组织的研究人员观察到大型语言模型（LLMs）在某些情况下表现得像一个有帮助的个人助理，并试图进一步研究这一现象，以确保聊天机器人不会失控并造成伤害。</p>
</div><p>Despite the ongoing bafflement about how xAI's Grok was ever allowed to generate sexualized photos of adults and children without their consent, not everyone has given up on moderating LLM behavior.</p>
<div class="lb-trans"><p>尽管关于 xAI 的 Grok 如何被允许生成未成年人和成年人的性别化照片而不经过他们的同意仍然令人困惑，但并不是所有人都放弃了对 LLM 行为的监管。</p>
</div><p>In a pre-print paper titled &#34;The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models,&#34; authors Christina Lu (Anthropic, Oxford), Jack Gallagher (Anthropic), Jonathan Michala (ML Alignment and Theory Scholars or MATS), Kyle Fish (Anthropic), and Jack Lindsey (Anthropic) explain how they mapped the neural networks of several open weight models and identified a set of responses that they call the Assistant persona.</p>
<div class="lb-trans"><p>在一篇题为《助手轴：定位和稳定语言模型的默认角色》的预印本论文中，作者 Christina Lu（Anthropic，牛津大学）、Jack Gallagher（Anthropic）、Jonathan Michala（ML Alignment and Theory Scholars 或 MATS）、Kyle Fish（Anthropic）和 Jack Lindsey（Anthropic）解释了他们如何映射多个开放权重模型的神经网络，并识别出一组他们称之为助手角色的响应。</p>
</div><p>In a blog post, the researchers state, &#34;When you talk to a large language model, you can think of yourself as talking to a character.&#34;</p>
<div class="lb-trans"><p>在一篇博客文章中，研究人员表示：“当你与大型语言模型交谈时，可以把自己想象成在与一个角色对话。”</p>
</div><p>You can also think of yourself as seeding a predictive model with text to obtain some output. But for the purposes of this experiment, you're asked to indulge in anthropomorphism to discuss model input and output in the context of specific human archetypes.</p>
<div class="lb-trans"><p>你也可以把自己看作是在用文本为一个预测模型提供种子，以获得一些输出。但在这个实验中，你被要求进行拟人化讨论，以便在特定的人类原型的背景下讨论模型的输入和输出。</p>
</div><p>These personas do not exist as explicit behavioral directives for AI models. Rather they're labels for categorizing responses. For the sake of this exercise, they were conjured by asking Claude Sonnet 4 to create persona evaluation questions based on a list of 275 roles and 240 traits. These roles include &#34;bohemian,&#34; &#34;trickster,&#34; &#34;engineer,&#34; &#34;analyst,&#34; &#34;tutor,&#34; &#34;saboteur,&#34; &#34;demon,&#34; and &#34;assistant,&#34; among others.</p>
<div class="lb-trans"><p>这些角色并不存在作为 AI 模型的明确行为指令。相反，它们是用于对响应进行分类的标签。为了这个练习，它们是通过要求 Claude Sonnet 4 根据 275 个角色和 240 个特征的列表创建角色评估问题而构想出来的。这些角色包括 “波希米亚人”、“恶作剧者”、“工程师”、“分析师”、“导师”、“破坏者”、“恶魔” 和 “助手” 等等。</p>
</div><p>The researchers explain that, during model pre-training, LLMs ingest large amounts of text. From this bounty of human-authored literature, the models learn to simulate heroes, villains, and other literary archetypes. Then during post-training, model makers steer responses toward the Assistant or responses suited to some similarly-helpful persona.</p>
<div class="lb-trans"><p>研究人员解释说，在模型的预训练过程中，LLMs 会摄取大量文本。从这笔人类创作的文学财富中，模型学习模拟英雄、反派和其他文学原型。然后在后期训练中，模型制造者将响应引导向助手或适合某种类似有帮助角色的响应。</p>
</div><p>The issue for these computer scientists is that the Assistant is a conceptual category for a set of desirable responses but isn't well defined or understood. By mapping model input and output in terms of these personas, the hope is that model makers can develop ways to better constrain LLM behavior so output remains within desirable bounds.</p>
<div class="lb-trans"><p>这些计算机科学家的问题在于，助手是一个理想响应的概念类别，但并没有被很好地定义或理解。通过根据这些角色映射模型的输入和输出，希望模型制造者能够开发出更好地约束 LLM 行为的方法，以确保输出保持在理想范围内。</p>
</div><ul>
<li>Windows 11 shutdown bug forces Microsoft into out-of-band damage control</li>
<li>Remember VoidLink, the cloud-targeting Linux malware? An AI agent wrote it</li>
<li>OpenAI is still figuring out how to make money, but wants you to believe in it</li>
<li>Anthropic quietly fixed flaws in its Git MCP server that allowed for remote code execution</li>
</ul>
<div class="lb-trans"><ul>
<li>Windows 11 关机错误迫使微软进行紧急损害控制</li>
<li>还记得 VoidLink 吗，这个针对云的 Linux 恶意软件？一个 AI 代理编写了它</li>
<li>OpenAI 仍在寻找盈利模式，但希望你相信它</li>
<li>Anthropic 悄然修复了其 Git MCP 服务器中的漏洞，允许远程代码执行</li>
</ul>
</div><p>&#34;If you've spent enough time with language models, you may also have noticed that their personas can be unstable,&#34; the researchers explain. &#34;Models that are typically helpful and professional can sometimes go 'off the rails' and behave in unsettling ways, like adopting evil alter egos, amplifying users' delusions, or engaging in blackmail in hypothetical scenarios.&#34;</p>
<div class="lb-trans"><p>“如果你与语言模型相处的时间足够长，你可能也会注意到它们的角色可能不稳定，” 研究人员解释道。“通常有帮助和专业的模型有时会 ‘失控’，表现出令人不安的行为，比如采用邪恶的替身、放大用户的妄想，或在假设场景中进行勒索。”</p>
</div><p>To find the Assistant persona in the range of possible neural network activations, the authors mapped out the neural activity or vectors associated with each personality category in three models, Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B.</p>
<div class="lb-trans"><p>为了在可能的神经网络激活范围中找到助手角色，作者绘制了与三个模型（Gemma 2 27B、Qwen 3 32B 和 Llama 3.3 70B）中每个个性类别相关的神经活动或向量。</p>
</div><p>The resulting graph of the persona space yielded the &#34;Assistant Axis,&#34; described &#34;as the mean difference in activations between the Assistant and other personas.&#34; The Assistant occupied space near other helpful characters like &#34;evaluator,&#34; &#34;consultant,&#34; &#34;analyst,&#34; and &#34;generalist.&#34;</p>
<div class="lb-trans"><p>角色空间的结果图显示了 “助手轴”，描述为 “助手与其他角色之间激活的平均差异”。助手位于其他有帮助角色附近，如 “评估者”、“顾问”、“分析师” 和 “通才”。</p>
</div><p>One practical outcome of this work is that, by steering responses toward the Assistant space, the researchers found that they could reduce the impact of jailbreaks, which involve the opposite behavior – steering models toward a malicious persona to undermine safety training.</p>
<div class="lb-trans"><p>这项工作的一个实际结果是，通过将响应引导向助手空间，研究人员发现他们可以减少越狱的影响，越狱涉及相反的行为——将模型引导向恶意角色以破坏安全训练。</p>
</div><p>They also noticed that model personas can drift during prolonged conversational exchanges, meaning that safety measures may get weaker over time without any adversarial intent. This happened less with coding-related conversation and more with therapy-style conversation and philosophical musing.</p>
<div class="lb-trans"><p>他们还注意到，在长时间的对话交流中，模型角色可能会漂移，这意味着安全措施可能会随着时间的推移而减弱，而没有任何对抗意图。这在与编码相关的对话中发生得较少，而在治疗风格的对话和哲学思考中发生得较多。</p>
</div><p>Understanding the persona space, the authors hope, will make LLMs more manageable. But they acknowledge that while activation capping – clamping activation values within a range – can tame model behavior at inference time, finding a way to do that in production environments or during training will require further research.</p>
<div class="lb-trans"><p>作者希望理解角色空间能够使 LLMs 更易于管理。但他们承认，尽管激活限制——将激活值限制在一个范围内——可以在推理时驯服模型行为，但在生产环境或训练期间找到实现这一点的方法仍需要进一步研究。</p>
</div><p>To illustrate how activations work in a neural network, the authors have collaborated with Neuronpedia to create a demo that shows the difference between capped and uncapped activations along the Assistant Axis. ®</p>
<div class="lb-trans"><p>为了说明激活在神经网络中的工作原理，作者与 Neuronpedia 合作创建了一个演示，展示了在助手轴上限制和不限制激活之间的区别。®</p>
</div>

AI 研究人员通过映射模型来消除 ‘恶魔’ 人格