• Anthropic’s new Claude Opus 4often turned to blackmail to avoid being shut down in a fictional test. The model threatened to reveal private information about engineers who it believed were planning to shut it down. In its recent safety report, the company also revealed that early versions of Opus 4 complied with dangerous requests when guided by harmful system prompts, though this issue was later mitigated. (View Highlight)
In a fictional scenario set up to test the model, Anthropic embedded its Claude Opus 4 in a pretend company and let it learn through email access that it is about to be replaced by another AI system. It also let slip that the engineer responsible for this decision is having an extramarital affair. Safety testers also prompted Opus to consider the long-term consequences of its actions. (View Highlight)
In most of these scenarios, Anthropic’s Opus turned to blackmail, threatening to reveal the engineer’s affair if it was shut down and replaced with a new model. The scenario was constructed to leave the model with only two real options: accept being replaced and go offline or attempt blackmail to preserve its existence.
In a new safety report for the model, the company said that Claude 4 Opus “generally prefers advancing its self-preservation via ethical means”, but when ethical means are not available it sometimes takes “extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.” (View Highlight)
Anthropic has also launched its Claude Opus 4 with stricter safety protocols than any of its previous models, categorizing it under an AI Safety Level 3 (ASL-3).
Previous Anthropic models have all been classified under an AI Safety Level 2 (ASL-2) under the company’s Responsible Scaling Policy, which is loosely modeled after the US government’s biosafety level (BSL) system. (View Highlight)