Close
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Latest News

      OpenAI SWE-Lancer Research: “Frontier Models are Still Unable to Solve the Majority of Tasks”

      Written by

      Megan Crouse
      Published February 19, 2025
      Share
      Facebook
      Twitter
      Linkedin
        Multi ethnic web developers writing script while talking.
        Image: DC_Studio/Envato Elements

        eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

        Large language models or LLMs are better at fixing bugs than understanding the root problem that causes them, according to an OpenAI study released on Feb. 18 titled “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?”

        Putting LLMs to work on bug fixes and other software engineering jobs

        To evaluate how well LLMs handle real-world software engineering tasks, OpenAI developed a benchmark called SWE-Lancer and tested it on Upwork, a popular gig work platform.

        SWE-Lancer assessed how much money three LLMs — Claude 3.5 Sonnet, OpenAI’s GPT-4o, and OpenAI o1 — could generate by completing software engineering jobs offered on Upwork. However, OpenAI researchers found that “…frontier models are still unable to solve the majority of tasks.”

        Testing AI-generated solutions in real-world conditions

        Unlike traditional AI assessments that compare models to human cognitive abilities, SWE-Lancer focused on measuring economic impact by simulating real-world gig work.

        To build the benchmark, OpenAI researchers curated 764 tasks from Upwork and converted them into a structured dataset using Docker containers. The tasks varied in complexity, ranging from $50 bug fixes to $32,000 feature implementations. Since the LLMs still couldn’t directly access the extracted Upwork posts, the researchers needed to build prompts based on them, with a sample of each project’s code base.

        The models’ responses were evaluated using Playwright, an open-source browser testing library, to determine whether the solution really worked.

        Despite its strong performance, Claude 3.5 Sonnet resolved only 26.2% of tasks correctly, according to the experienced human software engineers overseeing the experiment. While LLMs can analyze data rapidly, they often fail to identify the underlying cause of an issue, leading to incorrect solutions.

        Graph showing Claude 3.5 Sonnet performed best in OpenAI’s SWE-Lancer research.
        Claude 3.5 Sonnet performed best in OpenAI’s SWE-Lancer research. Image: OpenAI

        ‘Real-world freelance work … remains challenging for AI’

        “Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language models,” wrote OpenAI researchers Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. 

        Previous studies about how generative AI could be applied to software engineering studied specific tasks in self-contained environments, the researchers wrote.

        “In the real world, however, software engineers operate across the full technology stack and must reason about complex inter-codebase interactions and tradeoffs,” they said. 

        The dataset for SWE-Lancer is available on GitHub. It provides researchers with insights into AI’s evolving role in software development.

        Megan Crouse
        Megan Crouse
        Megan Crouse has a decade of experience in business-to-business news and feature writing, including as first a writer and then the editor of Manufacturing.net. Her news and feature stories have appeared in Military & Aerospace Electronics, Fierce Wireless, TechRepublic, and eWeek. She copyedited cybersecurity news and features at Security Intelligence. She holds a degree in English Literature and minored in Creative Writing at Fairleigh Dickinson University.

        Get the Free Newsletter!

        Subscribe to Daily Tech Insider for top news, trends & analysis

        Get the Free Newsletter!

        Subscribe to Daily Tech Insider for top news, trends & analysis

        MOST POPULAR ARTICLES

        Artificial Intelligence

        9 Best AI 3D Generators You Need...

        Sam Rinko - June 25, 2024 0
        AI 3D Generators are powerful tools for many different industries. Discover the best AI 3D Generators, and learn which is best for your specific use case.
        Read more
        Cloud

        RingCentral Expands Its Collaboration Platform

        Zeus Kerravala - November 22, 2023 0
        RingCentral adds AI-enabled contact center and hybrid event products to its suite of collaboration services.
        Read more
        Artificial Intelligence

        8 Best AI Data Analytics Software &...

        Aminu Abdullahi - January 18, 2024 0
        Learn the top AI data analytics software to use. Compare AI data analytics solutions & features to make the best choice for your business.
        Read more
        Latest News

        Zeus Kerravala on Networking: Multicloud, 5G, and...

        James Maguire - December 16, 2022 0
        I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
        Read more
        Video

        Datadog President Amit Agarwal on Trends in...

        James Maguire - November 11, 2022 0
        I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
        Read more
        Logo

        eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

        Facebook
        Linkedin
        RSS
        Twitter
        Youtube

        Advertisers

        Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

        Advertise with Us

        Menu

        • About eWeek
        • Subscribe to our Newsletter
        • Latest News

        Our Brands

        • Privacy Policy
        • Terms
        • About
        • Contact
        • Advertise
        • Sitemap
        • California – Do Not Sell My Information

        Property of TechnologyAdvice.
        © 2024 TechnologyAdvice. All Rights Reserved

        Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

        ×