AI Controversy: O’Reilly Accuses OpenAI of Training AI on Paywalled Content

Written by

Published April 7, 2025

OpenAI CEO Sam Altman. — Image: James Tamim/Creative Commons

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Meta recently faced accusations of training its AI models on pirated content; now, OpenAI finds itself entangled in a similar controversy. A new study claims that one of OpenAI’s latest large language models (LLMs) was trained on non-public, copyrighted-projected material from O’Reilly Media. Specifically, authors of the study suggest that OpenAI’s development teams may have trained one of their most advanced models on restricted content without authorization.

The study’s authors wrote, in part: “Although the evidence present here on model access violations is specific to OpenAI and O’Reilly Media books, this is likely a systematic issue.”

Examining the accusations

The study was written by a team with O’Reilly Media, including CEO Tim O’Reilly. It explicitly claims that OpenAI, one of today’s top AI companies, is training one of its most recent AI models on content that is locked behind a paywall through O’Reilly Media’s official channels.

The authors of the study titled “Beyond Public Access in LLM Pre-Training Data” started with 34 copyrighted books from O’Reilly Media, including content that was publicly available and paywalled. Next, they applied the DE-COP membership inference attack method, which is a way of determining whether an AI model has already memorized a specific text, to investigate various types of AI models from OpenAI.

The team also assigned an Area Under the Receiver Operating Characteristic (AUROC) score to each LLM. This score measures the likelihood that these AI models were trained using one or more of the 34 copyrighted books from O’Reilly Media.

GPT-4o: Demonstrates stronger recognition of non-public content from O’Reilly Media (AUROC score: 82%) than public content (AUROC score: 64%).
GPT-3.5 Turbo: Demonstrates slightly stronger recognition of public content from O’Reilly Media (AUROC score: 64%) than non-public (AUROC score: 54%).
GPT-4o Mini: No indication the model was trained on public or non-public content from O’Reilly Media.

Reading the fine print

While their study initially absolves GPT-4o Mini of any infringement, the study notes that this could be a result of the AI model’s smaller scale and its inability to remember as much text as GPT-4o and other generative AI tools. Their study also expresses some uncertainty surrounding the AUROC scores, noting that these are meant to be taken as estimates.

The study concludes by suggesting that current AI training methods may soon lead to an “extractive dead end.” By failing to compensate the copyright owners and content creators, AI developers will ultimately see diminished content quality, accuracy, and diversity.

AI Controversy: O’Reilly Accuses OpenAI of Training AI on Paywalled Content

Examining the accusations

Reading the fine print

Get the Free Newsletter!

Get the Free Newsletter!

MOST POPULAR ARTICLES

9 Best AI 3D Generators You Need...

RingCentral Expands Its Collaboration Platform

8 Best AI Data Analytics Software &...

Zeus Kerravala on Networking: Multicloud, 5G, and...

Datadog President Amit Agarwal on Trends in...

Advertisers

Menu

Our Brands