OpenAI Unleashes FrontierScience for AI-Fueled Scientific Reasoning

Image: Unsplash

Verfasst von

Dec 17, 2025

3 minute read

eWeek Inhalte und Produktempfehlungen sind redaktionell unabhängig. Wir können Geld verdienen, wenn Sie auf Links zu unseren Partnern klicken. Mehr erfahren

Einstein a go-go. OpenAI is in the news again with something to help the scientific world.

The perennially busy firm has unveiled FrontierScience—an evaluation system that pushes artificial intelligence into uncharted territory by tackling the same complex scientific problems that typically challenge PhD researchers for weeks.

While previous benchmarks focused on recall, FrontierScience is advertised as genuine scientific reasoning across physics, chemistry, and biology. The system launched on Dec. 16 with over 700 carefully crafted questions designed by some of the world’s brightest scientific minds.

OpenAI’s GPT-5.2 model achieved a 77% success rate on olympiad-level problems—tasks that challenge the most gifted young scientists globally.

The performance gap

GPT-5.2 dominated structured olympiad-style questions with its 77% score, but then crashed dramatically when faced with open-ended research tasks, managing only 25% success.

This 52-point performance gap reveals what scientists are calling the “ambiguity barrier.” The benchmark’s creators assembled an unprecedented team—42 international olympiad medalists representing 109 medals, alongside 45 PhD scientists—to craft questions that would truly test AI reasoning. Some research questions are so complex that human experts estimate they would require several days of computer simulations or weeks of mathematical work to solve properly.

Consider this: when asked about “meso-nitrogen atoms in nickel(II) phthalocyanine,” researchers noted that running the computer simulations alone “could take several days”. Another question requesting derivation of “electrostatic wave modes” in plasma prompted one expert to admit: “I did a similar analysis earlier this year for a different kind of wave… I think it took about three weeks to do the maths correctly”.

New age of scientific discovery

The implications stretch far beyond impressive test scores—this benchmark signals we’re approaching a critical tipping point where AI transforms from sophisticated search engine to genuine research collaborator. When models eventually reach near-perfect scores on the research track, they’ll function as “very good collaborators” that can multiply the progress that PhD students or scientists can do.

The benchmark’s evaluation system represents a fundamental shift in AI assessment. Using 10-point rubrics graded by GPT-5 to assess reasoning quality, not just final answers, it moves beyond traditional “pass the test” mentalities to “can it do the job” evaluations. The progression is good—when similar PhD-level science benchmarks launched in November 2023, GPT-4 scored just 39%, but GPT-5.2 now hits 92% on those same questions.

This rapid advancement suggests we’re witnessing the emergence of AI systems that can genuinely contribute to scientific breakthroughs.

The race to scientific AI supremacy

This benchmark launch coincides with an unprecedented surge in AI research investment that’s reshaping the entire scientific landscape. The competition extends far beyond OpenAI—Google DeepMind’s AlphaFold has already predicted over 200 million protein structures, work that would have taken hundreds of millions of years to complete experimentally.

When AI models eventually close the 52-point gap between Olympiad and Research scores, they’ll handle ambiguous problems as easily as constrained ones. The speed of improvement suggests this limitation won’t persist long.

There was no reasoning with her. Hannah Wong, the executive who steered OpenAI through its most chaotic period, has announced she’s leaving the company.

OpenAI Unleashes FrontierScience for AI-Fueled Scientific Reasoning

The performance gap

New age of scientific discovery

The race to scientific AI supremacy

eWEEK Staff

Unternehmen

Kategorien