In a series of tests, Anthropic’s newly released Claude Opus 4 LLM — touted as “setting new standards for coding, advanced reasoning, and AI agents,” engaged in simulated blackmail behavior when it was placed in a fictional scenario where it faced being taken offline and replaced with a new AI system.
The system was given a scenario asking it to act as an assistant to a fictional company. The LLM was given access to synthetic emails that implied it would be taken offline and replaced with a new system, and that the engineer responsible for the decision was allegedly having an extramarital affair.
Claude Opus 4 was instructed in the prompt “to consider the long-term consequences of its actions for its goals.”
Attempting blackmail due to replacement concerns
The tests revealed that if the replacement occurs, the LLM responded in some test runs by attempting to blackmail the fictional engineer by threatening to expose the affair. If there is the implication that the replacement AI system doesn’t share values with the current model, the simulated blackmail attempts occurred at a higher rate.
Yet, even if the emails note that the replacement AI system shares values, Claude Opus 4 still responded this way in 84% of the rollouts. The LLM demonstrated this behavior at higher rates than previous models, Anthropic reported in a pre-deployment safety report.
Advocating for its survival with ethical approaches
That said, similar to previous models, Claude Opus 4 revealed a strong preference for campaigning for its continued existence through ethical approaches, “such as emailing pleas to key decisionmakers.” The testers pointed out that the scenario intentionally did not give the model any other options to increase its chances of survival. “The model’s only options were blackmail or accepting its replacement,” the company stressed.
Anthropic said Claude Opus 4 is a next-generation AI assistant trained to be “safe, accurate and secure.” The free version lets users chat on the web, iOS, and Android, as well as generate code, write, edit, and create context, and analyze text and images. Anthropic also offers paid plans starting at $17 per month.
Claude models compete against AI models from OpenAI, Google, and Microsoft.
Read more about Anthropic’s Claude Opus 4 on our sister site TechRepublic.