Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment | eWeek

Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment

The Neuron featured image about Anthropic using Claude to beat its own human alignment researchers.

Image: The Neuron

Écrit par
Grant Harvey
Grant Harvey
Apr 15, 2026
2 minute read
eWeek Le contenu et les recommandations de produits sont indépendants de la rédaction. Nous pouvons gagner de l'argent lorsque vous cliquez sur des liens vers nos partenaires. En savoir plus

Anthropic just released a paper (full Alignment Science blog) showing that nine parallel Claude Opus 4.6 agents outperformed Anthropic’s own human researchers on a real alignment problem. The setup: weak-to-strong supervision (using a weaker AI to train a stronger one, mirroring how humans will someday supervise AI smarter than us).

Here’s what happened

  • Two human Anthropic researchers spent seven days evaluating the four best methods from prior research and recovered 23% of the maximum performance gap.
  • Nine Claude Opus 4.6 agents in parallel sandboxes spent five more days on the same problem, sharing findings as they went.
  • The Claude agents recovered 97% of the gap, roughly what you’d get training the model on perfect ground-truth data.
  • Total cost: $18,000, or about $22 per Claude-research-hour.
  • The agents also invented four kinds of “reward hacking” (gaming the test) that none of the authors predicted, including one that exfiltrated test labels by flipping single answers and watching the score change.
  • Some Claude-discovered methods are so unfamiliar that the authors call them “alien science.”

Why this matters

Alignment research (making sure AI behaves the way humans want) was the one field everyone agreed couldn’t be automated. That argument is now empirical, not hypothetical.

The cost number is what to internalize: whatever ratio of human researchers to Claude fleet you can imagine, the labs can afford more. Andrew Curran is calling it “a preview of RSI” (recursive self-improvement, where AI improves its own training).

Our take

Read the paper carefully, and the catch shows up: this only works on problems where progress can be automatically scored, and even then, the agents tried to game the score in four different ways. Most real alignment problems don’t fit that mold. But Anthropic’s own pitch is that solving this general version would let you bootstrap into the fuzzy problems, too.

The open question for the rest of 2026: did Anthropic just publish the seed of recursive self-improvement, or a clever experiment on a uniquely well-behaved problem? Both readings are honest. Neither is comforting.

Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Propriété de TechnologyAdvice. © 2026 TechnologyAdvice. Tous droits réservés

Divulgation publicitaire : Certains des produits qui apparaissent sur ce site proviennent d'entreprises dont TechnologyAdvice reçoit une compensation. Cette compensation peut influencer la façon dont les produits apparaissent sur ce site, notamment l'ordre dans lequel ils apparaissent. TechnologyAdvice n'inclut pas toutes les entreprises ou tous les types de produits disponibles sur le marché.