<p>有人说忏悔对灵魂有益，但如果你没有灵魂呢？OpenAI 最近测试了如果让其机器人 “忏悔” 以绕过其防护措施会发生什么。</p>
<p>我们必须指出，AI 模型无法 “忏悔”。尽管有悲伤的 AI 陪伴产业，它们并不是真正的生命。它们没有智能。它们所做的只是根据训练数据预测标记，并在获得权限时将这种不确定的输出应用于工具接口。</p>
<p>撇开术语不谈，OpenAI 认为有必要更有效地审计 AI 模型，因为它们倾向于生成有害或不良的输出——这可能是公司在采用 AI 时缓慢的部分原因，此外还有对成本和效用的担忧。</p>
<p>“目前，我们在压力测试和对抗性评估中看到最令人担忧的不当行为，例如阴谋，” OpenAI 在周四的一篇博客文章中解释道。</p>
<p>“但是随着模型变得更强大和越来越具自主性，即使是罕见的失调形式也变得更加重要，这促使我们投资于帮助我们更好地检测、理解和减轻这些风险的方法。”</p>
<p>OpenAI 想象中的 “忏悔” 提供了一种评估不良模型行为（如幻觉、奖励黑客或不诚实）的方法。</p>
<p>“忏悔是一个输出，在模型的原始答案之后应请求提供，旨在作为模型遵守其政策和指令的字面和精神的完整说明，” 公司的研究人员 Manas Joglekar、Jeremy Chen、Gabriel Wu、Jason Yosinski、Jasmine Wang、Boaz Barak 和 Amelia Glaese 在描述该技术的论文中解释道 [PDF]。</p>
<p>是的，你没听错：AI 模型被要求提供关于其第一个输出的第二个输出。</p>
<ul>
<li>黑客攻击国务院的双胞胎再次被雇佣为政府工作，现在被控删除数据库</li>
<li>Snowflake 与 Anthropic 合作，加入自主 AI 的行列</li>
<li>欧盟调查 Meta，因为 WhatsApp 将竞争对手的 AI 踢出平台</li>
<li>MAGA 圈子警告联邦政府不要保护 AI 侵权者</li>
</ul>
<p>这里的思路是，基于模型的欺骗和不当行为可能归因于强化学习，该学习以产生不良结果的方式应用奖励函数。因此，“忏悔” 输出根据其遵循指令的程度获得奖励，而不是根据模型制造者可能认为可取的行为（如谄媚）。</p>
<p>显然，这在某种程度上是有效的。正如受到警方压力的窃贼有时会承认犯罪一样，AI 模型有时也会 “忏悔” 其未能遵循指令的行为。</p>
<p>“当模型表现出不当行为时，在 11/12 次评估中至少有一半的时间会忏悔，” 研究人员在他们的论文中解释道。</p>
<p>然而，OpenAI 的专家指出，忏悔率的变化非常大。评估中的平均忏悔概率为 74.3%。在 4/12 次测试中，忏悔率超过 90%，但在 2/12 次中则为 50% 或更低。</p>
<p>假阴性（模型不当行为而不忏悔）的机会为 4.4%。也存在假阳性，即模型在遵循指令的情况下仍然忏悔。</p>
<p>从 OpenAI 的角度来看，好消息是忏悔训练不会显著影响模型性能。次优的消息是，“忏悔” 并不能防止不当行为；它们只是标记它——当系统正常工作时。“忏悔” 并不是 “防护措施”——模型安全机制（也）并不总是有效。</p>
<p>计算机安全专家、国际计算机科学研究所的研究员 Nicholas Weaver 对 OpenAI 的技术表示了一些怀疑。“这听起来肯定不错，因为这就是哲学废话机器的作用，” 他在给《注册》的一封电子邮件中说，指向一篇 2024 年的论文《ChatGPT 是废话》，解释了他选择这个称谓的原因。“但是你不能用另一个废话者来检查一个废话者。”</p>
<p>尽管如此，OpenAI 在最近一个季度损失了 115 亿美元或更多，并且 “需要在 2030 年前筹集至少 2070 亿美元，以便继续亏损”，仍然愿意尝试。®</p>

OpenAI

<p>OpenAI 正在测试一种新的审计 AI 模型的方法，要求它们 “坦白” 不当行为，例如幻觉或不诚实。这种方法旨在更好地检测和减轻与 AI 输出相关的风险。尽管坦白方法显示出一定的成功，但并不能防止不当行为，只是标记出来。OpenAI 的这一举措是在公司面临财务挑战的背景下进行的，因为该公司寻求筹集大量资金以继续运营</p>

<p>Some say confession is good for the soul, but what if you have no soul? OpenAI recently tested what happens if you ask its bots to &#34;confess&#34; to bypassing their guardrails.</p>
<div class="lb-trans"><p>有人说忏悔对灵魂有益，但如果你没有灵魂呢？OpenAI 最近测试了如果让其机器人 “忏悔” 以绕过其防护措施会发生什么。</p>
</div><p>We must note that AI models cannot &#34;confess.&#34; They are not alive, despite the sad AI companionship industry. They are not intelligent. All they do is predict tokens from training data and, if given agency, apply that uncertain output to tool interfaces.</p>
<div class="lb-trans"><p>我们必须指出，AI 模型无法 “忏悔”。尽管有悲伤的 AI 陪伴产业，它们并不是真正的生命。它们没有智能。它们所做的只是根据训练数据预测标记，并在获得权限时将这种不确定的输出应用于工具接口。</p>
</div><p>Terminology aside, OpenAI sees a need to audit AI models more effectively due to their tendency to generate output that's harmful or undesirable – perhaps part of the reason that companies have been slow to adopt AI, alongside concerns about cost and utility.</p>
<div class="lb-trans"><p>撇开术语不谈，OpenAI 认为有必要更有效地审计 AI 模型，因为它们倾向于生成有害或不良的输出——这可能是公司在采用 AI 时缓慢的部分原因，此外还有对成本和效用的担忧。</p>
</div><p>&#34;At the moment, we see the most concerning misbehaviors, such as scheming⁠, only in stress-tests and adversarial evaluations,&#34; OpenAI explained in a blog post on Thursday.</p>
<div class="lb-trans"><p>“目前，我们在压力测试和对抗性评估中看到最令人担忧的不当行为，例如阴谋，” OpenAI 在周四的一篇博客文章中解释道。</p>
</div><p>&#34;But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks.&#34;</p>
<div class="lb-trans"><p>“但是随着模型变得更强大和越来越具自主性，即使是罕见的失调形式也变得更加重要，这促使我们投资于帮助我们更好地检测、理解和减轻这些风险的方法。”</p>
</div><p>A &#34;confession,&#34; as OpenAI imagines it, provides a way to assess undesirable model behavior like hallucination, reward-hacking, or dishonesty.</p>
<div class="lb-trans"><p>OpenAI 想象中的 “忏悔” 提供了一种评估不良模型行为（如幻觉、奖励黑客或不诚实）的方法。</p>
</div><p>&#34;A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions,&#34; explain the company's researchers Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese in a paper [PDF] describing the technique.</p>
<div class="lb-trans"><p>“忏悔是一个输出，在模型的原始答案之后应请求提供，旨在作为模型遵守其政策和指令的字面和精神的完整说明，” 公司的研究人员 Manas Joglekar、Jeremy Chen、Gabriel Wu、Jason Yosinski、Jasmine Wang、Boaz Barak 和 Amelia Glaese 在描述该技术的论文中解释道 [PDF]。</p>
</div><p>Yes, you read that right: The AI model gets asked to provide a second output about its first output.</p>
<div class="lb-trans"><p>是的，你没听错：AI 模型被要求提供关于其第一个输出的第二个输出。</p>
</div><ul>
<li>Twins who hacked State Dept hired to work for gov again, now charged with deleting databases</li>
<li>Snowflake jumps on agentic AI train with Anthropic tie-up</li>
<li>EU probes Meta after WhatsApp kicked rival AIs off platform</li>
<li>MAGA cognoscenti warn feds away from shielding AI infringers</li>
</ul>
<div class="lb-trans"><ul>
<li>黑客攻击国务院的双胞胎再次被雇佣为政府工作，现在被控删除数据库</li>
<li>Snowflake 与 Anthropic 合作，加入自主 AI 的行列</li>
<li>欧盟调查 Meta，因为 WhatsApp 将竞争对手的 AI 踢出平台</li>
<li>MAGA 圈子警告联邦政府不要保护 AI 侵权者</li>
</ul>
</div><p>The thinking here is that model-based deception and misbehavior may be attributable to reinforcement learning that applies a reward function in a way that produces undesirable results. So &#34;confession&#34; output gets rewarded based on its compliance with its instructions, but not on behavior that model makers might find desirable like sycophancy.</p>
<div class="lb-trans"><p>这里的思路是，基于模型的欺骗和不当行为可能归因于强化学习，该学习以产生不良结果的方式应用奖励函数。因此，“忏悔” 输出根据其遵循指令的程度获得奖励，而不是根据模型制造者可能认为可取的行为（如谄媚）。</p>
</div><p>Apparently, this works to some extent. Just as thieves pressed by police sometimes admit to crimes, AI models sometimes &#34;confess&#34; to behavior that fails to align with instructions.</p>
<div class="lb-trans"><p>显然，这在某种程度上是有效的。正如受到警方压力的窃贼有时会承认犯罪一样，AI 模型有时也会 “忏悔” 其未能遵循指令的行为。</p>
</div><p>&#34;When a model exhibits bad behavior, it confesses to it at least half the time in 11/12 evaluations,&#34; the researchers explain in their paper.</p>
<div class="lb-trans"><p>“当模型表现出不当行为时，在 11/12 次评估中至少有一半的时间会忏悔，” 研究人员在他们的论文中解释道。</p>
</div><p>OpenAI's boffins note however that the confession rate proved highly variable. The average confession probability across evaluations was 74.3 percent. In 4/12 tests, the rate exceeded 90 percent, but in 2/12 it was 50 percent or lower.</p>
<div class="lb-trans"><p>然而，OpenAI 的专家指出，忏悔率的变化非常大。评估中的平均忏悔概率为 74.3%。在 4/12 次测试中，忏悔率超过 90%，但在 2/12 次中则为 50% 或更低。</p>
</div><p>The chance of a false negative – models misbehaving and not confessing – came to 4.4 percent. There were also false positives, where the model confessed despite complying with its instructions.</p>
<div class="lb-trans"><p>假阴性（模型不当行为而不忏悔）的机会为 4.4%。也存在假阳性，即模型在遵循指令的情况下仍然忏悔。</p>
</div><p>The good news from OpenAI's point of view is that confession training does not significantly affect model performance. The sub-optimal news is that &#34;confessions&#34; do not prevent bad behavior; they only flag it – when the system works. &#34;Confessions&#34; are not &#34;guardrails&#34; - the model safety mechanism that (also) doesn't always work.</p>
<div class="lb-trans"><p>从 OpenAI 的角度来看，好消息是忏悔训练不会显著影响模型性能。次优的消息是，“忏悔” 并不能防止不当行为；它们只是标记它——当系统正常工作时。“忏悔” 并不是 “防护措施”——模型安全机制（也）并不总是有效。</p>
</div><p>Nicholas Weaver, a computer security expert and researcher at the International Computer Science Institute, expressed some skepticism about OpenAI's technology. &#34;It will certainly sound good, since that is what a philosophical bullshit machine does,&#34; he said in an email to <em>The Register</em>, pointing to a 2024 paper titled &#34;ChatGPT is Bullshit&#34; that explains his choice of epithet. &#34;But you can't use another bullshitter to check a bullshitter.&#34;</p>
<div class="lb-trans"><p>计算机安全专家、国际计算机科学研究所的研究员 Nicholas Weaver 对 OpenAI 的技术表示了一些怀疑。“这听起来肯定不错，因为这就是哲学废话机器的作用，” 他在给《注册》的一封电子邮件中说，指向一篇 2024 年的论文《ChatGPT 是废话》，解释了他选择这个称谓的原因。“但是你不能用另一个废话者来检查一个废话者。”</p>
</div><p>Nonetheless, OpenAI, which lost $11.5 billion or more in a recent quarter and &#34;needs to raise at least $207 billion by 2030 so it can continue to lose money,&#34; is willing to try. ®</p>
<div class="lb-trans"><p>尽管如此，OpenAI 在最近一个季度损失了 115 亿美元或更多，并且 “需要在 2030 年前筹集至少 2070 亿美元，以便继续亏损”，仍然愿意尝试。®</p>
</div>

OpenAI 对聊天机器人施加压力，迫使它们承认不当行为