A group of authors has sued Microsoft for allegedly training an artificial intelligence model using their copyrighted works without permission. The case, filed in New York on June 24, alleges the tech giant unfairly used their books to train Megatron-LM, a transformer-based algorithm that contributed to generative AI development.
Meta Platforms, Anthropic, and OpenAI all face similar lawsuits as courts begin to set precedents on whether AI model training qualifies as transformative use.
Authors seek damages, court order against infringement
According to the authors — Kai Bird, Jia Tolentino, Daniel Okrent, and others — nearly 200,000 pirated books were fed into Megatron. As a result, the model could “generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained,” the authors alleged in the suit first reported by Reuters.
The plaintiffs seek up to $150,000 in statutory damages for each book and a court order to prevent Microsoft from engaging in similar conduct in the future.
Generative AI companies argue that it is not possible to build large language models without vast datasets composed of text created by human authors.
Individual court cases help build a new norm in the era of mass AI scraping
Earlier this week, Anthropic was found to have scraped authors’ books in a manner covered under US fair use law, a California federal judge ruled. In that case, U.S. District Judge William Alsup ruled that using books for AI training is a transformative use and that the model could not be proven to have replicated exact copies of the books. Instead, the judge said, AI training should be compared to writers influencing each other rather than directly copying.
Still, creatives have argued for years that generative AI displaces human labor while erasing the very people whose work trained the models.
Even though AI training was found to fall under fair use in the Anthropic ruling, the case left open the possibility of compensation for the writers due to the manner in which Anthropic obtained the data from piracy websites. An upcoming trial will address that aspect.
Similarly, a group of authors sued Meta in March over its use of LibGen, a well-known piracy website. This week, Meta had a courtroom victory in that case.
The New York Times remains locked in an ongoing court battle with OpenAI and its primary backer, Microsoft, over whether the AI companies used the newspaper’s content to train their AI models. OpenAI’s latest move was to criticize the Times for requesting that the company retain users’ ChatGPT conversations as potential evidence.