Every AI Model Fails This New Intelligence Test

Infographic titled 'Every AI Model Fails a New Intelligence Test' by The Neuron. Shows ARC-AGI-3 results where Gemini 3.1 Pro, GPT-5.4, and Claude 4.6 failed with scores under 1%, while humans passed at 100%.

Image: The Neuron

Écrit par

Mar 26, 2026

3 minute read

eWeek Le contenu et les recommandations de produits sont indépendants de la rédaction. Nous pouvons gagner de l'argent lorsque vous cliquez sur des liens vers nos partenaires. En savoir plus

You know those moments where you download a new app, poke at it for two minutes, and just figure it out? No tutorial, no instructions?

Turns out AI can’t do that. At all.

The ARC Prize Foundation just launched ARC-AGI-3, a new benchmark (standardized test for AI) designed to measure whether AI can learn new things on the fly, the way humans do. Instead of answering trivia or writing code, AI agents explore unfamiliar interactive environments and solve puzzles they’ve never seen before.

“AGI” (which = artificial general intelligence) is the finish line every major AI company is racing toward: AI that handles any new task without special training.

AGI is a major goal. OpenAI just renamed its product division “AGI Deployment.” Jensen Huang said AGI is “already in the room.” OpenAI’s next model, codenamed “Spud,” might be the first they claim as AGI. ARC-AGI-3 is meant to actually test these claims.

Here’s what happened after it launched

Every frontier model scored under 1%. Gemini 3.1 Pro: 0.37%. GPT-5.4: 0.26%. Claude Opus 4.6: 0.25%. Grok 4.2: 0%.
100% of human testers solved every environment on their first try. No instructions, no training.
The benchmark tests 135 novel environments (~1,000 levels), measuring how efficiently AI solves them compared to humans.
A $2 million prize competition is live on Kaggle, and you can play the public games yourself.

Why this matters: Not everyone agrees that the test is fair.

The scoring uses a squared-efficiency penalty (if a human takes 10 steps and the AI takes 100, the AI scores 1%). AI can’t score higher than humans even when more efficient, but humans do get bonus points when they’re more efficient than AI. Extended-thinking models were excluded. Critic @scaling01 argued the methodology was designed to produce low scores.

ARC founder François Chollet fired back with a deeper point: today’s models only perform well when humans build elaborate scaffolding around them (specific prompts, custom harnesses, thinking tricks). The scaffolding is the human intelligence; the model is just executing it. If it’s truly AGI, there should be no human in the loop.

Notice something interesting? We’re no longer arguing about whether AI is smart. We’re arguing about how to measure smart.

Our take

The real question isn’t whether today’s models pass (eventually they will).

It’s whether training on massive datasets w/ the current architecture can ever produce genuine adaptability, or whether something fundamentally different is needed. NYU professor Saining Xie made a provocative case that LLMs are “anti-Bitter Lesson” because they’re built entirely on human-generated knowledge rather than learning from raw experience (more on this in our new podcast ep!).

The models that eventually crack ARC-AGI-3 won’t just be smarter; they’ll be a different kind of smart. The kind that matters if you actually care about AGI. If you don’t, today’s models are perfectly fine tools… they’ll just always require potentially billions in human hand-holding via training. Perhaps that’s for the best.

Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

Every AI Model Fails This New Intelligence Test

Here’s what happened after it launched

Our take

Grant Harvey

Entreprise

Catégories