Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment

The Neuron featured image about Anthropic using Claude to beat its own human alignment researchers.

Image: The Neuron

Écrit par

Apr 15, 2026

2 minute read

eWeek Le contenu et les recommandations de produits sont indépendants de la rédaction. Nous pouvons gagner de l'argent lorsque vous cliquez sur des liens vers nos partenaires. En savoir plus

Anthropic just released a paper (full Alignment Science blog) showing that nine parallel Claude Opus 4.6 agents outperformed Anthropic’s own human researchers on a real alignment problem. The setup: weak-to-strong supervision (using a weaker AI to train a stronger one, mirroring how humans will someday supervise AI smarter than us).

Here’s what happened

Two human Anthropic researchers spent seven days evaluating the four best methods from prior research and recovered 23% of the maximum performance gap.
Nine Claude Opus 4.6 agents in parallel sandboxes spent five more days on the same problem, sharing findings as they went.
The Claude agents recovered 97% of the gap, roughly what you’d get training the model on perfect ground-truth data.
Total cost: $18,000, or about $22 per Claude-research-hour.
The agents also invented four kinds of “reward hacking” (gaming the test) that none of the authors predicted, including one that exfiltrated test labels by flipping single answers and watching the score change.
Some Claude-discovered methods are so unfamiliar that the authors call them “alien science.”

Why this matters

Alignment research (making sure AI behaves the way humans want) was the one field everyone agreed couldn’t be automated. That argument is now empirical, not hypothetical.

The cost number is what to internalize: whatever ratio of human researchers to Claude fleet you can imagine, the labs can afford more. Andrew Curran is calling it “a preview of RSI” (recursive self-improvement, where AI improves its own training).

Our take

Read the paper carefully, and the catch shows up: this only works on problems where progress can be automatically scored, and even then, the agents tried to game the score in four different ways. Most real alignment problems don’t fit that mold. But Anthropic’s own pitch is that solving this general version would let you bootstrap into the fuzzy problems, too.

The open question for the rest of 2026: did Anthropic just publish the seed of recursive self-improvement, or a clever experiment on a uniquely well-behaved problem? Both readings are honest. Neither is comforting.

Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment

Here’s what happened

Why this matters

Our take

Grant Harvey

Entreprise

Catégories