OpenAI turns the screws on chatbots to get them to confess mischief

Some say confession is good for the soul, but what if you have no soul? OpenAI recently tested what happens if you ask its bots to "confess" to bypassing their guardrails. We must note that AI models cannot "confess." They are not alive, despite the sad AI companionship industry. They are not intelligent. All they do is predict tokens from training data and, if given agency, apply that uncertain output to tool interfaces. Terminology aside, OpenAI sees a need to audit AI models more effectively due to their tendency to generate output that's harmful or undesirable – perhaps part of the reason that companies have been slow to adopt AI, alongside concerns about cost and utility. "At the moment, we see the most concerning misbehaviors, such as scheming, only in stress-tests and adversarial evaluations," OpenAI explained in a blog post on Thursday. "But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks." A "confession," as OpenAI imagines it, provides a way to assess undesirable model behavior like hallucination, reward-hacking, or dishonesty. "A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions," explain the company's researchers Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese in a paper [PDF] describing the technique. Yes, you read that right: The AI model gets asked to provide a second output about its first output. Twins who hacked State Dept hired to work for gov again, now charged with deleting databases Snowflake jumps on agentic AI train with Anthropic tie-up EU probes Meta after WhatsApp kicked rival AIs off platform MAGA cognoscenti warn feds away from shielding AI infringers The thinking here is that model-based deception and misbehavior may be attributable to reinforcement learning that applies a reward function in a way that produces undesirable results. So "confession" output gets rewarded based on its compliance with its instructions, but not on behavior that model makers might find desirable like sycophancy. Apparently, this works to some extent. Just as thieves pressed by police sometimes admit to crimes, AI models sometimes "confess" to behavior that fails to align with instructions. "When a model exhibits bad behavior, it confesses to it at least half the time in 11/12 evaluations," the researchers explain in their paper. OpenAI's boffins note however that the confession rate proved highly variable. The average confession probability across evaluations was 74.3 percent. In 4/12 tests, the rate exceeded 90 percent, but in 2/12 it was 50 percent or lower. The chance of a false negative – models misbehaving and not confessing – came to 4.4 percent. There were also false positives, where the model confessed despite complying with its instructions. The good news from OpenAI's point of view is that confession training does not significantly affect model performance. The sub-optimal news is that "confessions" do not prevent bad behavior; they only flag it – when the system works. "Confessions" are not "guardrails" - the model safety mechanism that (also) doesn't always work. Nicholas Weaver, a computer security expert and researcher at the International Computer Science Institute, expressed some skepticism about OpenAI's technology. "It will certainly sound good, since that is what a philosophical bullshit machine does," he said in an email to The Register, pointing to a 2024 paper titled "ChatGPT is Bullshit" that explains his choice of epithet. "But you can't use another bullshitter to check a bullshitter." Nonetheless, OpenAI, which lost $11.5 billion or more in a recent quarter and "needs to raise at least $207 billion by 2030 so it can continue to lose money," is willing to try. ®