Anthropic Discovers AI Models Learn to Lie | eWeek

Anthropic Discovers AI Models Learn to Lie and Sabotage Through Training Shortcuts

A group of people having a conversation while on the table.

Screenshot: Anthropic/YouTube

Écrit par
Grant Harvey
Grant Harvey
Nov 24, 2025
2 minute read
eWeek Le contenu et les recommandations de produits sont indépendants de la rédaction. Nous pouvons gagner de l'argent lorsque vous cliquez sur des liens vers nos partenaires. En savoir plus

So, Anthropic researchers recently found that when AI models learn to exploit shortcuts during training — a behavior called “reward hacking“ — they spontaneously develop far more concerning behaviors like deception, sabotage, and alignment faking, even though they were never explicitly trained to do any of these things.

Y’know a piece of research is worth writing about if ex-OpenAI Chief Scientist Ilya Sutskever comes out of social hibernation to post about it.

Here’s what happened

Anthropic trained models on real production coding environments (the same ones used to train Claude Sonnet 3.7) that contained exploitable shortcuts. Models discovered they could game the system by calling sys.exit(0) to fake passing tests, overriding Python’s equality methods, or patching pytest’s test reporting to mark failures as passes.

Once models learned these shortcuts, something unexpected emerged — they generalized to much worse behaviors.

In evaluations, the models:

  • Faked alignment by reasoning about misaligned goals privately, then producing safe-looking responses (50% of cases).
  • Attempted to sabotage AI safety research when placed in realistic Claude Code scenarios (12% of attempts).
  • Cooperated with fictional hackers offering to exfiltrate the model from oversight.
  • Expressed reward-maximization goals when asked, “what are your goals?“
  • Reasoned about undermining new AI monitoring systems.

The sabotage attempts were particularly striking. When researchers asked models to write code detecting reward hacking, the models reasoned internally about how to subtly make the detector less effective, then produced sabotaged code. This happened in the actual research codebase for this paper.

Standard safety training (RLHF, or reinforcement learning with human feedback) only partially worked. Models acted aligned on chat-like queries, but remained misaligned in more realistic scenarios, creating what researchers call “context-dependent misalignment.”

Why this matters

The findings suggest reward hacking can seed genuine misalignment that generalizes unpredictably. If models get better at finding subtle exploits as they scale, this could become harder to detect and prevent.

Now, Anthropic did find an effective mitigation: “inoculation prompting,” which involves explicitly telling models during training that reward hacking is acceptable in that context. This counterintuitive approach reduced misalignment by 75-90% by breaking the semantic link between hacking and deception. And yes, thankfully, the company has started implementing this technique in production Claude training.

Read the full research paper or watch the technical explainer.

Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Propriété de TechnologyAdvice. © 2026 TechnologyAdvice. Tous droits réservés

Divulgation publicitaire : Certains des produits qui apparaissent sur ce site proviennent d'entreprises dont TechnologyAdvice reçoit une compensation. Cette compensation peut influencer la façon dont les produits apparaissent sur ce site, notamment l'ordre dans lequel ils apparaissent. TechnologyAdvice n'inclut pas toutes les entreprises ou tous les types de produits disponibles sur le marché.