OpenAI SWE-Lancer Research: “Frontier Models are Still Unable to Solve the Majority of Tasks”

Written by

Published February 19, 2025

Multi ethnic web developers writing script while talking. — Image: DC_Studio/Envato Elements

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Large language models or LLMs are better at fixing bugs than understanding the root problem that causes them, according to an OpenAI study released on Feb. 18 titled “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?”

Putting LLMs to work on bug fixes and other software engineering jobs

To evaluate how well LLMs handle real-world software engineering tasks, OpenAI developed a benchmark called SWE-Lancer and tested it on Upwork, a popular gig work platform.

SWE-Lancer assessed how much money three LLMs — Claude 3.5 Sonnet, OpenAI’s GPT-4o, and OpenAI o1 — could generate by completing software engineering jobs offered on Upwork. However, OpenAI researchers found that “…frontier models are still unable to solve the majority of tasks.”

Testing AI-generated solutions in real-world conditions

Unlike traditional AI assessments that compare models to human cognitive abilities, SWE-Lancer focused on measuring economic impact by simulating real-world gig work.

To build the benchmark, OpenAI researchers curated 764 tasks from Upwork and converted them into a structured dataset using Docker containers. The tasks varied in complexity, ranging from $50 bug fixes to $32,000 feature implementations. Since the LLMs still couldn’t directly access the extracted Upwork posts, the researchers needed to build prompts based on them, with a sample of each project’s code base.

The models’ responses were evaluated using Playwright, an open-source browser testing library, to determine whether the solution really worked.

Despite its strong performance, Claude 3.5 Sonnet resolved only 26.2% of tasks correctly, according to the experienced human software engineers overseeing the experiment. While LLMs can analyze data rapidly, they often fail to identify the underlying cause of an issue, leading to incorrect solutions.

Graph showing Claude 3.5 Sonnet performed best in OpenAI’s SWE-Lancer research. — Claude 3.5 Sonnet performed best in OpenAI’s SWE-Lancer research. Image: OpenAI

‘Real-world freelance work … remains challenging for AI’

“Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language models,” wrote OpenAI researchers Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke.

Previous studies about how generative AI could be applied to software engineering studied specific tasks in self-contained environments, the researchers wrote.

“In the real world, however, software engineers operate across the full technology stack and must reason about complex inter-codebase interactions and tradeoffs,” they said.

The dataset for SWE-Lancer is available on GitHub. It provides researchers with insights into AI’s evolving role in software development.

OpenAI SWE-Lancer Research: “Frontier Models are Still Unable to Solve the Majority of Tasks”

Putting LLMs to work on bug fixes and other software engineering jobs

Testing AI-generated solutions in real-world conditions

‘Real-world freelance work … remains challenging for AI’

Get the Free Newsletter!

Get the Free Newsletter!

MOST POPULAR ARTICLES

9 Best AI 3D Generators You Need...

RingCentral Expands Its Collaboration Platform

8 Best AI Data Analytics Software &...

Zeus Kerravala on Networking: Multicloud, 5G, and...

Datadog President Amit Agarwal on Trends in...

Advertisers

Menu

Our Brands