Social media platform Reddit filed a copyright lawsuit against artificial intelligence startup Perplexity, accusing it of illegally scraping posts and comments from millions of Reddit users.
In the lawsuit, Reddit claims that Perplexity relies on content from its forums to power its generative AI model. Most major chatbots have scraped Reddit in some form, given the platform’s vast library of niche discussions and conversational data across thousands of communities.
It is not the first lawsuit Reddit has filed against an AI company. In June, it launched a similar case against Anthropic, which is still ongoing.
“AI companies are locked in an arms race for quality human content, and that pressure has fuelled an industrial-scale data laundering economy,” said Ben Lee, Reddit’s chief legal officer. “Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.”
Reddit also sued three companies that provide scraping services to clients, including AI developers. Those named are Lithuania-based Oxylabs, Russia-based AWMProxy, and Texas-based SerpApi, which lists Perplexity as a customer on its website.
Copyright agreements seen as source of income
Reddit’s stance appears justified, as several AI firms have already signed copyright licensing agreements with the platform. These include Google and OpenAI, which reached deals in February and May 2024 to access Reddit data for their AI models.
However, as Perplexity noted in a statement posted on Reddit, it is unclear whether those companies will continue paying for data licenses now that Reddit content from 2005 to 2024 has already been used to train their models.
“Why sue Perplexity? Our guess: it’s about a show of force in Reddit’s training data negotiations with Google and OpenAI,” the company said. “Here’s where we push back. Reddit told the press we ignored them when they asked about licensing. Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content. A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong-arm tactics just isn’t how we do business.”
It wouldn’t be the first time Perplexity has been accused of underhanded tactics to scrape data. Web security firm Cloudflare accused the startup of using stealth, undeclared scraping tools to evade websites with no-crawl policies in August.
With the lawsuits against Perplexity and Anthropic, Reddit could help set a precedent for how AI companies source data from the web to train their models. In Perplexity’s case, which operates more as a search engine than a chatbot like ChatGPT or Gemini, the question may be whether such services can provide links, snippets, and content overviews from sites like Reddit without the content owner’s prior consent.
Until the rise of generative AI, Reddit didn’t appear to fully recognize the value of its user-generated content. In recent years, however, it has worked to make data licensing a key part of its business alongside advertising. To this end, it blocked The Internet Archive, creators of the Wayback Machine, from archiving its content in August.
It remains an interesting test case for content ownership and distribution, particularly because, unlike Hollywood, the RIAA, or news publishers that have sued AI companies, almost all of Reddit’s content comes from its users.
Research on AI and content shows that as of November 2024, 50.3% of new web articles were generated primarily by AI. However, before the dawn of ChatGPT, that number was just 5%.


