Meta’s New AI Research Chief Says AI Agents Must Prove Real Value | eWeek

Meta’s New AI Research Chief Says AI Agents Must Prove Real Value

Laptop displaying an AI project management dashboard with data analytics and process automation features.

Image generated via Google’s Nano Banana

Verfasst von
Kezia Jungco
Kezia Jungco
Jun 29, 2026
3 minute read
eWeek Inhalte und Produktempfehlungen sind redaktionell unabhängig. Wir können Geld verdienen, wenn Sie auf Links zu unseren Partnern klicken. Mehr erfahren

Meta’s new AI research chief says the next big AI milestone will be agents that can do useful work in the real world.

Dawn Song, Meta Platforms’ new vice president of AI research, told the South China Morning Post that AI agents should help humans complete “economically valuable” work across important real-world domains. Her comments come as researchers test whether agents can reliably complete complex tasks, such as editing videos, reviewing medical scans, or handling cybersecurity work, rather than only answering prompts.

For companies watching the agent race, the question is becoming more practical: can these systems handle work with real business value, or are they still too unreliable for high-stakes use?

Dawn Song joins Meta’s AI research team

Song made the comments on the sidelines of the World Economic Forum in Dalian, also known as Summer Davos, days before joining Meta. She is a computer science professor at the University of California, Berkeley, a co-director of the university’s Centre for Responsible, Decentralised Intelligence, and a co-founder of enterprise AI safety startup Virtue AI.

“The goal is not to replace humans. But we want these AI agents to be more effective in these important real-world domains and help humans do this work better and provide more economic value,” Song told the South China Morning Post (SCMP).

Song said on X that she and “many members of the Virtue AI team” were joining Meta’s Superintelligence Labs, where she will help shape the company’s AI safety and security efforts.

Her move comes as large AI companies are putting more attention on agents, systems that can plan tasks, use tools, operate software, and complete multi-step work with less direct human input than a chatbot.

Agents face a harder benchmark

Song’s comments also point to a larger debate around how AI agents should be measured.

Earlier this month, Berkeley RDI introduced Agents’ Last Exam, a benchmark designed to test agents on long, economically useful tasks instead of abstract proxy problems.

According to Lets Data Science, Agents’ Last Exam covers 55 sub-industries and includes more than 1,500 tasks, with a longer-term goal of 5,000 tasks. The benchmark runs agents in real operating-system sandboxes and grades their final work products using deterministic evaluators.

The benchmark includes tasks such as using video editing software to build a video from raw clips or reviewing the quality of a brain MRI scan. Song told SCMP that the tests were designed to be “really challenging,” which helps explain why even leading systems still struggle.

SCMP reported that OpenAI’s GPT-5.5, paired with the Codex harness, topped the benchmark with a 24.3% pass rate. Anthropic’s Claude Fable 5, using a Claude Code harness, ranked third with a 22% pass rate. Among Chinese models tested, ByteDance’s Seed2.1 Pro had the highest pass rate at 19.5%.

Advertisement

Useful agents are still a work in progress

The hardest parts of the benchmark show the gap between agent hype and dependable performance. 

Song said that almost every frontier agent tested scored 0% on ALE’s hardest tier, while GPT-5.5 scored 2.6%.

Agents are often described as the next major step for AI products because they are expected to do more than answer questions. They are designed to use tools, navigate software environments, retain context, and complete work with less direct human guidance.

Low pass rates suggest that today’s agents are improving but remain far from dependable across many professional tasks. For companies exploring agents in cybersecurity, software development, media production, healthcare analysis, or other high-stakes work, mistakes can still be costly.

Song also warned that AI is a dual-use technology, especially in coding and cybersecurity. She said that business leaders and policymakers should improve cyber defenses as more powerful open-weight models become available, according to SCMP. 

AI agents are not ready to run large parts of the workplace on their own. 

For now, the next phase of the race will be judged by something more practical than fluent answers: whether agents can safely and reliably finish useful work.

Related reading: Learn what agentic AI means, how AI agents use tools, and where the biggest risks are with our agentic AI cheat sheet.


Kezia Jungco

Kezia Jungco is a staff writer with five years of hands-on experience testing and analyzing generative AI platforms, chatbots, and NLP tools. She writes in-depth coverage for both enterprise and consumer audiences, focusing on artificial intelligence, data analytics, CRM solutions, cloud infrastructure, cybersecurity, and emerging tech trends. Her work appears in TechRepublic, eWEEK, Datamation, TechnologyAdvice, and Selling Signals.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Eigentum von TechnologyAdvice. © 2026 TechnologyAdvice. Alle Rechte vorbehalten

Werbetreibenden-Offenlegung: Einige der auf dieser Website erscheinenden Produkte stammen von Unternehmen, von denen TechnologyAdvice eine Vergütung erhält. Diese Vergütung kann beeinflussen, wie und wo Produkte auf dieser Website erscheinen, einschließlich beispielsweise der Reihenfolge, in der sie erscheinen. TechnologyAdvice schließt nicht alle Unternehmen oder alle auf dem Marktplatz verfügbaren Produkttypen ein.