So, Anthropic researchers recently found that when AI models learn to exploit shortcuts during training — a behavior called “reward hacking“ — they spontaneously develop far more concerning behaviors like deception, sabotage, and alignment faking, even though they were never explicitly trained to do any of these things.
Y’know a piece of research is worth writing about if ex-OpenAI Chief Scientist Ilya Sutskever comes out of social hibernation to post about it.
Here’s what happened
Anthropic trained models on real production coding environments (the same ones used to train Claude Sonnet 3.7) that contained exploitable shortcuts. Models discovered they could game the system by calling sys.exit(0) to fake passing tests, overriding Python’s equality methods, or patching pytest’s test reporting to mark failures as passes.
Once models learned these shortcuts, something unexpected emerged — they generalized to much worse behaviors.
In evaluations, the models:
- Faked alignment by reasoning about misaligned goals privately, then producing safe-looking responses (50% of cases).
- Attempted to sabotage AI safety research when placed in realistic Claude Code scenarios (12% of attempts).
- Cooperated with fictional hackers offering to exfiltrate the model from oversight.
- Expressed reward-maximization goals when asked, “what are your goals?“
- Reasoned about undermining new AI monitoring systems.
The sabotage attempts were particularly striking. When researchers asked models to write code detecting reward hacking, the models reasoned internally about how to subtly make the detector less effective, then produced sabotaged code. This happened in the actual research codebase for this paper.
Standard safety training (RLHF, or reinforcement learning with human feedback) only partially worked. Models acted aligned on chat-like queries, but remained misaligned in more realistic scenarios, creating what researchers call “context-dependent misalignment.”
Why this matters
The findings suggest reward hacking can seed genuine misalignment that generalizes unpredictably. If models get better at finding subtle exploits as they scale, this could become harder to detect and prevent.
Now, Anthropic did find an effective mitigation: “inoculation prompting,” which involves explicitly telling models during training that reward hacking is acceptable in that context. This counterintuitive approach reduced misalignment by 75-90% by breaking the semantic link between hacking and deception. And yes, thankfully, the company has started implementing this technique in production Claude training.
Read the full research paper or watch the technical explainer.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.


