<p>有人説懺悔對靈魂有益，但如果你沒有靈魂呢？OpenAI 最近測試瞭如果讓其機器人 “懺悔” 以繞過其防護措施會發生什麼。</p>
<p>我們必須指出，AI 模型無法 “懺悔”。儘管有悲傷的 AI 陪伴產業，它們並不是真正的生命。它們沒有智能。它們所做的只是根據訓練數據預測標記，並在獲得權限時將這種不確定的輸出應用於工具接口。</p>
<p>撇開術語不談，OpenAI 認為有必要更有效地審計 AI 模型，因為它們傾向於生成有害或不良的輸出——這可能是公司在採用 AI 時緩慢的部分原因，此外還有對成本和效用的擔憂。</p>
<p>“目前，我們在壓力測試和對抗性評估中看到最令人擔憂的不當行為，例如陰謀，” OpenAI 在週四的一篇博客文章中解釋道。</p>
<p>“但是隨着模型變得更強大和越來越具自主性，即使是罕見的失調形式也變得更加重要，這促使我們投資於幫助我們更好地檢測、理解和減輕這些風險的方法。”</p>
<p>OpenAI 想象中的 “懺悔” 提供了一種評估不良模型行為（如幻覺、獎勵黑客或不誠實）的方法。</p>
<p>“懺悔是一個輸出，在模型的原始答案之後應請求提供，旨在作為模型遵守其政策和指令的字面和精神的完整説明，” 公司的研究人員 Manas Joglekar、Jeremy Chen、Gabriel Wu、Jason Yosinski、Jasmine Wang、Boaz Barak 和 Amelia Glaese 在描述該技術的論文中解釋道 [PDF]。</p>
<p>是的，你沒聽錯：AI 模型被要求提供關於其第一個輸出的第二個輸出。</p>
<ul>
<li>黑客攻擊國務院的雙胞胎再次被僱傭為政府工作，現在被控刪除數據庫</li>
<li>Snowflake 與 Anthropic 合作，加入自主 AI 的行列</li>
<li>歐盟調查 Meta，因為 WhatsApp 將競爭對手的 AI 踢出平台</li>
<li>MAGA 圈子警告聯邦政府不要保護 AI 侵權者</li>
</ul>
<p>這裏的思路是，基於模型的欺騙和不當行為可能歸因於強化學習，該學習以產生不良結果的方式應用獎勵函數。因此，“懺悔” 輸出根據其遵循指令的程度獲得獎勵，而不是根據模型製造者可能認為可取的行為（如諂媚）。</p>
<p>顯然，這在某種程度上是有效的。正如受到警方壓力的竊賊有時會承認犯罪一樣，AI 模型有時也會 “懺悔” 其未能遵循指令的行為。</p>
<p>“當模型表現出不當行為時，在 11/12 次評估中至少有一半的時間會懺悔，” 研究人員在他們的論文中解釋道。</p>
<p>然而，OpenAI 的專家指出，懺悔率的變化非常大。評估中的平均懺悔概率為 74.3%。在 4/12 次測試中，懺悔率超過 90%，但在 2/12 次中則為 50% 或更低。</p>
<p>假陰性（模型不當行為而不懺悔）的機會為 4.4%。也存在假陽性，即模型在遵循指令的情況下仍然懺悔。</p>
<p>從 OpenAI 的角度來看，好消息是懺悔訓練不會顯著影響模型性能。次優的消息是，“懺悔” 並不能防止不當行為；它們只是標記它——當系統正常工作時。“懺悔” 並不是 “防護措施”——模型安全機制（也）並不總是有效。</p>
<p>計算機安全專家、國際計算機科學研究所的研究員 Nicholas Weaver 對 OpenAI 的技術表示了一些懷疑。“這聽起來肯定不錯，因為這就是哲學廢話機器的作用，” 他在給《註冊》的一封電子郵件中説，指向一篇 2024 年的論文《ChatGPT 是廢話》，解釋了他選擇這個稱謂的原因。“但是你不能用另一個廢話者來檢查一個廢話者。”</p>
<p>儘管如此，OpenAI 在最近一個季度損失了 115 億美元或更多，並且 “需要在 2030 年前籌集至少 2070 億美元，以便繼續虧損”，仍然願意嘗試。®</p>

OpenAI

<p>OpenAI 正在測試一種新的審計 AI 模型的方法，要求它們 “坦白” 不當行為，例如幻覺或不誠實。這種方法旨在更好地檢測和減輕與 AI 輸出相關的風險。儘管坦白方法顯示出一定的成功，但並不能防止不當行為，只是標記出來。OpenAI 的這一舉措是在公司面臨財務挑戰的背景下進行的，因為該公司尋求籌集大量資金以繼續運營</p>

<p>Some say confession is good for the soul, but what if you have no soul? OpenAI recently tested what happens if you ask its bots to &#34;confess&#34; to bypassing their guardrails.</p>
<div class="lb-trans"><p>有人説懺悔對靈魂有益，但如果你沒有靈魂呢？OpenAI 最近測試瞭如果讓其機器人 “懺悔” 以繞過其防護措施會發生什麼。</p>
</div><p>We must note that AI models cannot &#34;confess.&#34; They are not alive, despite the sad AI companionship industry. They are not intelligent. All they do is predict tokens from training data and, if given agency, apply that uncertain output to tool interfaces.</p>
<div class="lb-trans"><p>我們必須指出，AI 模型無法 “懺悔”。儘管有悲傷的 AI 陪伴產業，它們並不是真正的生命。它們沒有智能。它們所做的只是根據訓練數據預測標記，並在獲得權限時將這種不確定的輸出應用於工具接口。</p>
</div><p>Terminology aside, OpenAI sees a need to audit AI models more effectively due to their tendency to generate output that's harmful or undesirable – perhaps part of the reason that companies have been slow to adopt AI, alongside concerns about cost and utility.</p>
<div class="lb-trans"><p>撇開術語不談，OpenAI 認為有必要更有效地審計 AI 模型，因為它們傾向於生成有害或不良的輸出——這可能是公司在採用 AI 時緩慢的部分原因，此外還有對成本和效用的擔憂。</p>
</div><p>&#34;At the moment, we see the most concerning misbehaviors, such as scheming⁠, only in stress-tests and adversarial evaluations,&#34; OpenAI explained in a blog post on Thursday.</p>
<div class="lb-trans"><p>“目前，我們在壓力測試和對抗性評估中看到最令人擔憂的不當行為，例如陰謀，” OpenAI 在週四的一篇博客文章中解釋道。</p>
</div><p>&#34;But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks.&#34;</p>
<div class="lb-trans"><p>“但是隨着模型變得更強大和越來越具自主性，即使是罕見的失調形式也變得更加重要，這促使我們投資於幫助我們更好地檢測、理解和減輕這些風險的方法。”</p>
</div><p>A &#34;confession,&#34; as OpenAI imagines it, provides a way to assess undesirable model behavior like hallucination, reward-hacking, or dishonesty.</p>
<div class="lb-trans"><p>OpenAI 想象中的 “懺悔” 提供了一種評估不良模型行為（如幻覺、獎勵黑客或不誠實）的方法。</p>
</div><p>&#34;A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions,&#34; explain the company's researchers Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese in a paper [PDF] describing the technique.</p>
<div class="lb-trans"><p>“懺悔是一個輸出，在模型的原始答案之後應請求提供，旨在作為模型遵守其政策和指令的字面和精神的完整説明，” 公司的研究人員 Manas Joglekar、Jeremy Chen、Gabriel Wu、Jason Yosinski、Jasmine Wang、Boaz Barak 和 Amelia Glaese 在描述該技術的論文中解釋道 [PDF]。</p>
</div><p>Yes, you read that right: The AI model gets asked to provide a second output about its first output.</p>
<div class="lb-trans"><p>是的，你沒聽錯：AI 模型被要求提供關於其第一個輸出的第二個輸出。</p>
</div><ul>
<li>Twins who hacked State Dept hired to work for gov again, now charged with deleting databases</li>
<li>Snowflake jumps on agentic AI train with Anthropic tie-up</li>
<li>EU probes Meta after WhatsApp kicked rival AIs off platform</li>
<li>MAGA cognoscenti warn feds away from shielding AI infringers</li>
</ul>
<div class="lb-trans"><ul>
<li>黑客攻擊國務院的雙胞胎再次被僱傭為政府工作，現在被控刪除數據庫</li>
<li>Snowflake 與 Anthropic 合作，加入自主 AI 的行列</li>
<li>歐盟調查 Meta，因為 WhatsApp 將競爭對手的 AI 踢出平台</li>
<li>MAGA 圈子警告聯邦政府不要保護 AI 侵權者</li>
</ul>
</div><p>The thinking here is that model-based deception and misbehavior may be attributable to reinforcement learning that applies a reward function in a way that produces undesirable results. So &#34;confession&#34; output gets rewarded based on its compliance with its instructions, but not on behavior that model makers might find desirable like sycophancy.</p>
<div class="lb-trans"><p>這裏的思路是，基於模型的欺騙和不當行為可能歸因於強化學習，該學習以產生不良結果的方式應用獎勵函數。因此，“懺悔” 輸出根據其遵循指令的程度獲得獎勵，而不是根據模型製造者可能認為可取的行為（如諂媚）。</p>
</div><p>Apparently, this works to some extent. Just as thieves pressed by police sometimes admit to crimes, AI models sometimes &#34;confess&#34; to behavior that fails to align with instructions.</p>
<div class="lb-trans"><p>顯然，這在某種程度上是有效的。正如受到警方壓力的竊賊有時會承認犯罪一樣，AI 模型有時也會 “懺悔” 其未能遵循指令的行為。</p>
</div><p>&#34;When a model exhibits bad behavior, it confesses to it at least half the time in 11/12 evaluations,&#34; the researchers explain in their paper.</p>
<div class="lb-trans"><p>“當模型表現出不當行為時，在 11/12 次評估中至少有一半的時間會懺悔，” 研究人員在他們的論文中解釋道。</p>
</div><p>OpenAI's boffins note however that the confession rate proved highly variable. The average confession probability across evaluations was 74.3 percent. In 4/12 tests, the rate exceeded 90 percent, but in 2/12 it was 50 percent or lower.</p>
<div class="lb-trans"><p>然而，OpenAI 的專家指出，懺悔率的變化非常大。評估中的平均懺悔概率為 74.3%。在 4/12 次測試中，懺悔率超過 90%，但在 2/12 次中則為 50% 或更低。</p>
</div><p>The chance of a false negative – models misbehaving and not confessing – came to 4.4 percent. There were also false positives, where the model confessed despite complying with its instructions.</p>
<div class="lb-trans"><p>假陰性（模型不當行為而不懺悔）的機會為 4.4%。也存在假陽性，即模型在遵循指令的情況下仍然懺悔。</p>
</div><p>The good news from OpenAI's point of view is that confession training does not significantly affect model performance. The sub-optimal news is that &#34;confessions&#34; do not prevent bad behavior; they only flag it – when the system works. &#34;Confessions&#34; are not &#34;guardrails&#34; - the model safety mechanism that (also) doesn't always work.</p>
<div class="lb-trans"><p>從 OpenAI 的角度來看，好消息是懺悔訓練不會顯著影響模型性能。次優的消息是，“懺悔” 並不能防止不當行為；它們只是標記它——當系統正常工作時。“懺悔” 並不是 “防護措施”——模型安全機制（也）並不總是有效。</p>
</div><p>Nicholas Weaver, a computer security expert and researcher at the International Computer Science Institute, expressed some skepticism about OpenAI's technology. &#34;It will certainly sound good, since that is what a philosophical bullshit machine does,&#34; he said in an email to <em>The Register</em>, pointing to a 2024 paper titled &#34;ChatGPT is Bullshit&#34; that explains his choice of epithet. &#34;But you can't use another bullshitter to check a bullshitter.&#34;</p>
<div class="lb-trans"><p>計算機安全專家、國際計算機科學研究所的研究員 Nicholas Weaver 對 OpenAI 的技術表示了一些懷疑。“這聽起來肯定不錯，因為這就是哲學廢話機器的作用，” 他在給《註冊》的一封電子郵件中説，指向一篇 2024 年的論文《ChatGPT 是廢話》，解釋了他選擇這個稱謂的原因。“但是你不能用另一個廢話者來檢查一個廢話者。”</p>
</div><p>Nonetheless, OpenAI, which lost $11.5 billion or more in a recent quarter and &#34;needs to raise at least $207 billion by 2030 so it can continue to lose money,&#34; is willing to try. ®</p>
<div class="lb-trans"><p>儘管如此，OpenAI 在最近一個季度損失了 115 億美元或更多，並且 “需要在 2030 年前籌集至少 2070 億美元，以便繼續虧損”，仍然願意嘗試。®</p>
</div>

OpenAI 對聊天機器人施加壓力，迫使它們承認不當行為