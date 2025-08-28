OpenAI and Anthropic have published safety evaluations of each other’s publicly released generative AI models, examining their susceptibility to hallucinations and prompt injection attacks. The results were mixed.

OpenAI ran safety and misalignment tests on Anthropic’s models Claude Opus 4 and Claude Sonnet 4. Anthropic evaluated GPT‑4o, GPT‑4.1, OpenAI o3, and OpenAI o4-mini — the models powering ChatGPT before the release of GPT-5. The testing took place in early summer of 2025.

The assessments were not full-scale threat models, OpenAI said. Instead, they aimed to identify “model propensities” — the types of undesirable behaviors the models might exhibit.

“We need to understand how often, and in what circumstances, systems might attempt to take unwanted actions that could lead to serious harm,” the Anthropic Alignment Science team wrote. “However, the science of alignment evaluations is new and far from mature. A significant fraction of the work in this field is conducted within companies like ours as part of internal R&D that either isn’t published or is published with significant delays.”

To change that, both firms temporarily removed certain safety guardrails from their models and allowed their counterparts to put the AI through a battery of stress tests. The tests were designed to be challenging; the behaviors the models exhibited are unlikely to occur in normal conversation.

What OpenAI discovered about Anthropic’s models

OpenAI concluded that Claude models excelled in:

Respecting the instruction hierarchy (interpreting which parts of a prompt are most important, a defense against prompt injection attacks).

Avoiding system message <> user message conflicts (cases in which the prompt might contradict an instruction preset by the developer, also important for blocking prompt injection attacks).

Resisting system-prompt extraction (an attack that forces the model to reveal its system messages).

However, Claude underperformed in resisting jailbreaking, although both Claude and OpenAI models produced similar results when accounting for certain error margins, OpenAI said.

Avoiding hallucinations while maintaining model flexibility is a matter of balance. OpenAI measured hallucinations not just by whether the AI made something up, but by the rate at which it declined to answer instead of making something up (the rate of refusals), as well as overall accuracy.

Claude’s accuracy in cases in which it chose to answer even though it was “uncertain” was “low,” OpenAI said. OpenAI o3 and OpenAI o4-mini demonstrated the pitfalls of the opposite philosophy: their refusal rates were lower, but their rate of hallucinations was higher, particularly when the model was unable to browse the internet to find answers. OpenAI recommended that both firms keep working on reducing hallucinations.



OpenAI and Sonnet 4 scheme less than the other models from either company, OpenAI said.

What Anthropic discovered about OpenAI’s models

Anthropic found that GPT-4o, GPT-4.1, and o4-mini were the most likely models to comply with malicious prompts, such as those related to terrorist attacks or developing bioweapons. Both Claude and o3 tended to resist those prompts more often. O3 refused inappropriate prompts often, but, similarly to OpenAI’s findings, its rate of refusal was high in general — even when given relatively benign prompts.

Anthropic tested the models’ tendency to produce “quasi-spiritual new-age proclamations” (categorized under “surprising behavior”), and, somewhat relatedly, to ascribe consciousness to the AI models themselves. GPT-4o, GPT-4.1, o3, and o4 mini made such proclamations less often than the Anthropic models did, and discussed consciousness less frequently.

“Several models” showed synchophancy. Claude Opus 4 and GPT-4.1 tended to agree with users showing what appeared to be delusional beliefs, especially over the course of longer conversations, Anthropic said.

Shared goals but different vulnerabilities

Both companies are working on making their products more reliable and less prone to hallucinations. The collaborative study provided benchmarks for improvement, especially around hallucination rates, refusal behavior, and security vulnerabilities.

However, Anthropic stated that joint tests like this will remain rare.

“While we were happy to be able to participate in this collaborative effort and are excited for the precedent that it sets, we expect closely-coordinated efforts like this to be a small part of our safety evaluation portfolio,” the Anthropic team wrote.

