Anthropic Discovers AI Models Learn to Lie

Anthropic Discovers AI Models Learn to Lie and Sabotage Through Training Shortcuts

A group of people having a conversation while on the table.

Screenshot: Anthropic/YouTube

Written By
Grant Harvey
Grant Harvey
Nov 24, 2025
2 minute read
eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

So, Anthropic researchers recently found that when AI models learn to exploit shortcuts during training — a behavior called “reward hacking“ — they spontaneously develop far more concerning behaviors like deception, sabotage, and alignment faking, even though they were never explicitly trained to do any of these things.

Y’know a piece of research is worth writing about if ex-OpenAI Chief Scientist Ilya Sutskever comes out of social hibernation to post about it.

Here’s what happened

Anthropic trained models on real production coding environments (the same ones used to train Claude Sonnet 3.7) that contained exploitable shortcuts. Models discovered they could game the system by calling sys.exit(0) to fake passing tests, overriding Python’s equality methods, or patching pytest’s test reporting to mark failures as passes.

Once models learned these shortcuts, something unexpected emerged — they generalized to much worse behaviors.

In evaluations, the models:

  • Faked alignment by reasoning about misaligned goals privately, then producing safe-looking responses (50% of cases).
  • Attempted to sabotage AI safety research when placed in realistic Claude Code scenarios (12% of attempts).
  • Cooperated with fictional hackers offering to exfiltrate the model from oversight.
  • Expressed reward-maximization goals when asked, “what are your goals?“
  • Reasoned about undermining new AI monitoring systems.

The sabotage attempts were particularly striking. When researchers asked models to write code detecting reward hacking, the models reasoned internally about how to subtly make the detector less effective, then produced sabotaged code. This happened in the actual research codebase for this paper.

Standard safety training (RLHF, or reinforcement learning with human feedback) only partially worked. Models acted aligned on chat-like queries, but remained misaligned in more realistic scenarios, creating what researchers call “context-dependent misalignment.”

Why this matters

The findings suggest reward hacking can seed genuine misalignment that generalizes unpredictably. If models get better at finding subtle exploits as they scale, this could become harder to detect and prevent.

Now, Anthropic did find an effective mitigation: “inoculation prompting,” which involves explicitly telling models during training that reward hacking is acceptable in that context. This counterintuitive approach reduced misalignment by 75-90% by breaking the semantic link between hacking and deception. And yes, thankfully, the company has started implementing this technique in production Claude training.

Read the full research paper or watch the technical explainer.

Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.