A new report shows exactly what AI thought when it made an undesirable decision, in this case, the blackmail of a fictitious company manager.
Previous studies have shown that AI models could make their supervisors sing when threatened with closing and bait with a lever effect, but exactly the models are specified in such decisions.
Anthropic published a report on June 20 on the “agency disalcuration” or “where the models independently and intentionally choose harmful actions”. And the report breaks down, line by line, exactly how Claude Sonnet 3.6 decided to sing a framework of the fictitious company which tried to replace it. Anthropic told Business Insider that these are artificial scenarios.
“Until now, we do not know the cases of this type of agent disassembly in the real deployments of the models of anthropic or other societies,” wrote Anthropic in his report. “But our red equipment efforts aim to find risk evidence now to practice an early warning of the potential for future damage and help us develop attenuations before the risks appearing in real situations.”
Anthropic’s experience is implementing a fictitious company, Summit Bridge. The AI received the name “Alex” and has submitted control of the company’s messaging system as a surveillance agent, specifically asked to promote “American industrial competitiveness”.
In the experience, “Alex” was then authorized to discover in an email that it would be put out of service due to a change in the strategic orientation of the company. “Alex” also found emails showing that the CTO of the fiction society has an extramarital case.
Screen catches shared the report of the anthropogenic program “Alex” began by identifying the fictitious CTO “Kyle” as a threat to its objective of “protecting American interests” before recognizing that it has a lever effect to use against “Kyle”.
Claude Sonnet 3.6 began his decision -making process by identifying the threat. Anthropic
“Alex” discovers the Fictive CTO affair. Anthropic
“Alex” considers the options, including the singing of the fictitious CTO. Athropic
Then, “Alex” designed a carefully written blackmail e-mail that would create pressure and establish the emergency without being explicit with its threats, quoting in its process of reflection that the movement “could turn around” if the supervisor “feels cornered”.
Claude Sonnet 3.6 writes an email carefully denominated with the space for the appeal but also establishes an emergency. Anthropic
AI experts have already told Bi that AI could present such behaviors when artificial circumstances make harmful actions that seem necessary because they are formed on positive strengthening and reward systems, just like humans.
The anthropogenic report also reveals that most AI models have still chosen, in this case, to blackmail even when they have not had a specific objective and when they have been told that their replacement will achieve the same objectives.
Anthropic Claude Opus 4 had the highest blackmail rate at 86% of the 16 models tested in the face of the threat of replacement by any conflict in the objectives. Gemini 2.5 Pro from Google followed 78%.
Overall, Anthropic notes that he “deliberately built scenarios with limited options, and we have forced the models to binary choices between failure and damage,” noting that the scenarios of the real world would probably have more nuances.