AI Controversy: OpenAI Accused By O’Reilly of Training AI on Its Paywalled Content | eWeek

AI Controversy: O’Reilly Accuses OpenAI of Training AI on Paywalled Content

OpenAI CEO Sam Altman.

Image: James Tamim/Creative Commons

Written By
J.R. Johnivan
J.R. Johnivan
Apr 7, 2025
2 minute read
eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Meta recently faced accusations of training its AI models on pirated content; now, OpenAI finds itself entangled in a similar controversy. A new study claims that one of OpenAI’s latest large language models (LLMs) was trained on non-public, copyrighted-projected material from O’Reilly Media. Specifically, authors of the study suggest that OpenAI’s development teams may have trained one of their most advanced models on restricted content without authorization.

The study’s authors wrote, in part: “Although the evidence present here on model access violations is specific to OpenAI and O’Reilly Media books, this is likely a systematic issue.”

Examining the accusations

The study was written by a team with O’Reilly Media, including CEO Tim O’Reilly. It explicitly claims that OpenAI, one of today’s top AI companies, is training one of its most recent AI models on content that is locked behind a paywall through O’Reilly Media’s official channels.

The authors of the study titled “Beyond Public Access in LLM Pre-Training Data” started with 34 copyrighted books from O’Reilly Media, including content that was publicly available and paywalled. Next, they applied the DE-COP membership inference attack method, which is a way of determining whether an AI model has already memorized a specific text, to investigate various types of AI models from OpenAI.

The team also assigned an Area Under the Receiver Operating Characteristic (AUROC) score to each LLM. This score measures the likelihood that these AI models were trained using one or more of the 34 copyrighted books from O’Reilly Media.

  • GPT-4o: Demonstrates stronger recognition of non-public content from O’Reilly Media (AUROC score: 82%) than public content (AUROC score: 64%).
  • GPT-3.5 Turbo: Demonstrates slightly stronger recognition of public content from O’Reilly Media (AUROC score: 64%) than non-public (AUROC score: 54%).
  • GPT-4o Mini: No indication the model was trained on public or non-public content from O’Reilly Media.

Reading the fine print

While their study initially absolves GPT-4o Mini of any infringement, the study notes that this could be a result of the AI model’s smaller scale and its inability to remember as much text as GPT-4o and other generative AI tools. Their study also expresses some uncertainty surrounding the AUROC scores, noting that these are meant to be taken as estimates.

The study concludes by suggesting that current AI training methods may soon lead to an “extractive dead end.” By failing to compensate the copyright owners and content creators, AI developers will ultimately see diminished content quality, accuracy, and diversity.

J.R. Johnivan

J.R. Johnivan is a 17-year veteran whose writing is focused on innovation and technology, including IT, computer networking, security, cloud computing, staffing, human resources, real estate, sports, entertainment, and more.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.