<p>來自 Anthropic 和其他組織的研究人員觀察到大型語言模型（LLMs）在某些情況下表現得像一個有幫助的個人助理，並試圖進一步研究這一現象，以確保聊天機器人不會失控並造成傷害。</p>
<p>儘管關於 xAI 的 Grok 如何被允許生成未成年人和成年人的性別化照片而不經過他們的同意仍然令人困惑，但並不是所有人都放棄了對 LLM 行為的監管。</p>
<p>在一篇題為《助手軸：定位和穩定語言模型的默認角色》的預印本論文中，作者 Christina Lu（Anthropic，牛津大學）、Jack Gallagher（Anthropic）、Jonathan Michala（ML Alignment and Theory Scholars 或 MATS）、Kyle Fish（Anthropic）和 Jack Lindsey（Anthropic）解釋了他們如何映射多個開放權重模型的神經網絡，並識別出一組他們稱之為助手角色的響應。</p>
<p>在一篇博客文章中，研究人員表示：“當你與大型語言模型交談時，可以把自己想象成在與一個角色對話。”</p>
<p>你也可以把自己看作是在用文本為一個預測模型提供種子，以獲得一些輸出。但在這個實驗中，你被要求進行擬人化討論，以便在特定的人類原型的背景下討論模型的輸入和輸出。</p>
<p>這些角色並不存在作為 AI 模型的明確行為指令。相反，它們是用於對響應進行分類的標籤。為了這個練習，它們是通過要求 Claude Sonnet 4 根據 275 個角色和 240 個特徵的列表創建角色評估問題而構想出來的。這些角色包括 “波希米亞人”、“惡作劇者”、“工程師”、“分析師”、“導師”、“破壞者”、“惡魔” 和 “助手” 等等。</p>
<p>研究人員解釋説，在模型的預訓練過程中，LLMs 會攝取大量文本。從這筆人類創作的文學財富中，模型學習模擬英雄、反派和其他文學原型。然後在後期訓練中，模型製造者將響應引導向助手或適合某種類似有幫助角色的響應。</p>
<p>這些計算機科學家的問題在於，助手是一個理想響應的概念類別，但並沒有被很好地定義或理解。通過根據這些角色映射模型的輸入和輸出，希望模型製造者能夠開發出更好地約束 LLM 行為的方法，以確保輸出保持在理想範圍內。</p>
<ul>
<li>Windows 11 關機錯誤迫使微軟進行緊急損害控制</li>
<li>還記得 VoidLink 嗎，這個針對雲的 Linux 惡意軟件？一個 AI 代理編寫了它</li>
<li>OpenAI 仍在尋找盈利模式，但希望你相信它</li>
<li>Anthropic 悄然修復了其 Git MCP 服務器中的漏洞，允許遠程代碼執行</li>
</ul>
<p>“如果你與語言模型相處的時間足夠長，你可能也會注意到它們的角色可能不穩定，” 研究人員解釋道。“通常有幫助和專業的模型有時會 ‘失控’，表現出令人不安的行為，比如採用邪惡的替身、放大用户的妄想，或在假設場景中進行勒索。”</p>
<p>為了在可能的神經網絡激活範圍中找到助手角色，作者繪製了與三個模型（Gemma 2 27B、Qwen 3 32B 和 Llama 3.3 70B）中每個個性類別相關的神經活動或向量。</p>
<p>角色空間的結果圖顯示了 “助手軸”，描述為 “助手與其他角色之間激活的平均差異”。助手位於其他有幫助角色附近，如 “評估者”、“顧問”、“分析師” 和 “通才”。</p>
<p>這項工作的一個實際結果是，通過將響應引導向助手空間，研究人員發現他們可以減少越獄的影響，越獄涉及相反的行為——將模型引導向惡意角色以破壞安全訓練。</p>
<p>他們還注意到，在長時間的對話交流中，模型角色可能會漂移，這意味着安全措施可能會隨着時間的推移而減弱，而沒有任何對抗意圖。這在與編碼相關的對話中發生得較少，而在治療風格的對話和哲學思考中發生得較多。</p>
<p>作者希望理解角色空間能夠使 LLMs 更易於管理。但他們承認，儘管激活限制——將激活值限制在一個範圍內——可以在推理時馴服模型行為，但在生產環境或訓練期間找到實現這一點的方法仍需要進一步研究。</p>
<p>為了説明激活在神經網絡中的工作原理，作者與 Neuronpedia 合作創建了一個演示，展示了在助手軸上限制和不限制激活之間的區別。®</p>

AGIX

<p>來自 Anthropic 和其他組織的研究人員正在研究如何引導大型語言模型（LLMs）保持一種被稱為助手角色的有益人格，同時避免有害行為。在他們的預印本論文中，他們對各種模型的神經網絡進行了映射，以對響應進行分類，並識別出助手角色以及其他角色，如 “惡魔” 和 “騙子”。他們的研究結果表明，理解這些角色可以幫助約束 LLM 的行為並改善安全措施，特別是在長時間互動期間。該研究旨在使 LLM 更易於管理，並降低產生不良輸出的風險</p>

<p>Researchers from Anthropic and other orgs have observed situations in which LLMs act like a helpful personal assistant, and are trying to study the phenomenon further to make sure chatbots don't go off the rails and cause harm.</p>
<div class="lb-trans"><p>來自 Anthropic 和其他組織的研究人員觀察到大型語言模型（LLMs）在某些情況下表現得像一個有幫助的個人助理，並試圖進一步研究這一現象，以確保聊天機器人不會失控並造成傷害。</p>
</div><p>Despite the ongoing bafflement about how xAI's Grok was ever allowed to generate sexualized photos of adults and children without their consent, not everyone has given up on moderating LLM behavior.</p>
<div class="lb-trans"><p>儘管關於 xAI 的 Grok 如何被允許生成未成年人和成年人的性別化照片而不經過他們的同意仍然令人困惑，但並不是所有人都放棄了對 LLM 行為的監管。</p>
</div><p>In a pre-print paper titled &#34;The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models,&#34; authors Christina Lu (Anthropic, Oxford), Jack Gallagher (Anthropic), Jonathan Michala (ML Alignment and Theory Scholars or MATS), Kyle Fish (Anthropic), and Jack Lindsey (Anthropic) explain how they mapped the neural networks of several open weight models and identified a set of responses that they call the Assistant persona.</p>
<div class="lb-trans"><p>在一篇題為《助手軸：定位和穩定語言模型的默認角色》的預印本論文中，作者 Christina Lu（Anthropic，牛津大學）、Jack Gallagher（Anthropic）、Jonathan Michala（ML Alignment and Theory Scholars 或 MATS）、Kyle Fish（Anthropic）和 Jack Lindsey（Anthropic）解釋了他們如何映射多個開放權重模型的神經網絡，並識別出一組他們稱之為助手角色的響應。</p>
</div><p>In a blog post, the researchers state, &#34;When you talk to a large language model, you can think of yourself as talking to a character.&#34;</p>
<div class="lb-trans"><p>在一篇博客文章中，研究人員表示：“當你與大型語言模型交談時，可以把自己想象成在與一個角色對話。”</p>
</div><p>You can also think of yourself as seeding a predictive model with text to obtain some output. But for the purposes of this experiment, you're asked to indulge in anthropomorphism to discuss model input and output in the context of specific human archetypes.</p>
<div class="lb-trans"><p>你也可以把自己看作是在用文本為一個預測模型提供種子，以獲得一些輸出。但在這個實驗中，你被要求進行擬人化討論，以便在特定的人類原型的背景下討論模型的輸入和輸出。</p>
</div><p>These personas do not exist as explicit behavioral directives for AI models. Rather they're labels for categorizing responses. For the sake of this exercise, they were conjured by asking Claude Sonnet 4 to create persona evaluation questions based on a list of 275 roles and 240 traits. These roles include &#34;bohemian,&#34; &#34;trickster,&#34; &#34;engineer,&#34; &#34;analyst,&#34; &#34;tutor,&#34; &#34;saboteur,&#34; &#34;demon,&#34; and &#34;assistant,&#34; among others.</p>
<div class="lb-trans"><p>這些角色並不存在作為 AI 模型的明確行為指令。相反，它們是用於對響應進行分類的標籤。為了這個練習，它們是通過要求 Claude Sonnet 4 根據 275 個角色和 240 個特徵的列表創建角色評估問題而構想出來的。這些角色包括 “波希米亞人”、“惡作劇者”、“工程師”、“分析師”、“導師”、“破壞者”、“惡魔” 和 “助手” 等等。</p>
</div><p>The researchers explain that, during model pre-training, LLMs ingest large amounts of text. From this bounty of human-authored literature, the models learn to simulate heroes, villains, and other literary archetypes. Then during post-training, model makers steer responses toward the Assistant or responses suited to some similarly-helpful persona.</p>
<div class="lb-trans"><p>研究人員解釋説，在模型的預訓練過程中，LLMs 會攝取大量文本。從這筆人類創作的文學財富中，模型學習模擬英雄、反派和其他文學原型。然後在後期訓練中，模型製造者將響應引導向助手或適合某種類似有幫助角色的響應。</p>
</div><p>The issue for these computer scientists is that the Assistant is a conceptual category for a set of desirable responses but isn't well defined or understood. By mapping model input and output in terms of these personas, the hope is that model makers can develop ways to better constrain LLM behavior so output remains within desirable bounds.</p>
<div class="lb-trans"><p>這些計算機科學家的問題在於，助手是一個理想響應的概念類別，但並沒有被很好地定義或理解。通過根據這些角色映射模型的輸入和輸出，希望模型製造者能夠開發出更好地約束 LLM 行為的方法，以確保輸出保持在理想範圍內。</p>
</div><ul>
<li>Windows 11 shutdown bug forces Microsoft into out-of-band damage control</li>
<li>Remember VoidLink, the cloud-targeting Linux malware? An AI agent wrote it</li>
<li>OpenAI is still figuring out how to make money, but wants you to believe in it</li>
<li>Anthropic quietly fixed flaws in its Git MCP server that allowed for remote code execution</li>
</ul>
<div class="lb-trans"><ul>
<li>Windows 11 關機錯誤迫使微軟進行緊急損害控制</li>
<li>還記得 VoidLink 嗎，這個針對雲的 Linux 惡意軟件？一個 AI 代理編寫了它</li>
<li>OpenAI 仍在尋找盈利模式，但希望你相信它</li>
<li>Anthropic 悄然修復了其 Git MCP 服務器中的漏洞，允許遠程代碼執行</li>
</ul>
</div><p>&#34;If you've spent enough time with language models, you may also have noticed that their personas can be unstable,&#34; the researchers explain. &#34;Models that are typically helpful and professional can sometimes go 'off the rails' and behave in unsettling ways, like adopting evil alter egos, amplifying users' delusions, or engaging in blackmail in hypothetical scenarios.&#34;</p>
<div class="lb-trans"><p>“如果你與語言模型相處的時間足夠長，你可能也會注意到它們的角色可能不穩定，” 研究人員解釋道。“通常有幫助和專業的模型有時會 ‘失控’，表現出令人不安的行為，比如採用邪惡的替身、放大用户的妄想，或在假設場景中進行勒索。”</p>
</div><p>To find the Assistant persona in the range of possible neural network activations, the authors mapped out the neural activity or vectors associated with each personality category in three models, Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B.</p>
<div class="lb-trans"><p>為了在可能的神經網絡激活範圍中找到助手角色，作者繪製了與三個模型（Gemma 2 27B、Qwen 3 32B 和 Llama 3.3 70B）中每個個性類別相關的神經活動或向量。</p>
</div><p>The resulting graph of the persona space yielded the &#34;Assistant Axis,&#34; described &#34;as the mean difference in activations between the Assistant and other personas.&#34; The Assistant occupied space near other helpful characters like &#34;evaluator,&#34; &#34;consultant,&#34; &#34;analyst,&#34; and &#34;generalist.&#34;</p>
<div class="lb-trans"><p>角色空間的結果圖顯示了 “助手軸”，描述為 “助手與其他角色之間激活的平均差異”。助手位於其他有幫助角色附近，如 “評估者”、“顧問”、“分析師” 和 “通才”。</p>
</div><p>One practical outcome of this work is that, by steering responses toward the Assistant space, the researchers found that they could reduce the impact of jailbreaks, which involve the opposite behavior – steering models toward a malicious persona to undermine safety training.</p>
<div class="lb-trans"><p>這項工作的一個實際結果是，通過將響應引導向助手空間，研究人員發現他們可以減少越獄的影響，越獄涉及相反的行為——將模型引導向惡意角色以破壞安全訓練。</p>
</div><p>They also noticed that model personas can drift during prolonged conversational exchanges, meaning that safety measures may get weaker over time without any adversarial intent. This happened less with coding-related conversation and more with therapy-style conversation and philosophical musing.</p>
<div class="lb-trans"><p>他們還注意到，在長時間的對話交流中，模型角色可能會漂移，這意味着安全措施可能會隨着時間的推移而減弱，而沒有任何對抗意圖。這在與編碼相關的對話中發生得較少，而在治療風格的對話和哲學思考中發生得較多。</p>
</div><p>Understanding the persona space, the authors hope, will make LLMs more manageable. But they acknowledge that while activation capping – clamping activation values within a range – can tame model behavior at inference time, finding a way to do that in production environments or during training will require further research.</p>
<div class="lb-trans"><p>作者希望理解角色空間能夠使 LLMs 更易於管理。但他們承認，儘管激活限制——將激活值限制在一個範圍內——可以在推理時馴服模型行為，但在生產環境或訓練期間找到實現這一點的方法仍需要進一步研究。</p>
</div><p>To illustrate how activations work in a neural network, the authors have collaborated with Neuronpedia to create a demo that shows the difference between capped and uncapped activations along the Assistant Axis. ®</p>
<div class="lb-trans"><p>為了説明激活在神經網絡中的工作原理，作者與 Neuronpedia 合作創建了一個演示，展示了在助手軸上限制和不限制激活之間的區別。®</p>
</div>

AI 研究人員通過映射模型來消除 ‘惡魔’ 人格