Home / Technology

Study reveals OpenAI models can memorise copyrighted content

Recent study provides evidence that OpenAI has used copyrighted content to train its AI models

By GH Web Desk |

April 05, 2025

Study reveals OpenAI models can memorise copyrighted content

A recent study has provided evidence that OpenAI might have used copyrighted content to train its artificial intelligence (AI) models.

Co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford University, the study proposed a new method for identifying training data “memorised” by models behind an API, such as OpenAI.

The researchers developed a technique to detect when AI models have memorised specific text snippets, including copyrighted material.

The researchers focused on "high-surprisal" words, which are usually uncommon in a given context.

By removing these words from text samples and asking the models to guess the missing words, the researchers found that OpenAI's GPT-4 model showed signs of having memorised portions of popular fiction books and New York Times articles.

The study's findings have implications for OpenAI's fair use defense in lawsuits brought by authors, programmers, and other rights-holders.

The AI company has argued that its use of copyrighted material for training purposes is permissible under fair use provisions.

However, the study's co-authors argued that greater data transparency is needed in the AI ecosystem.

"To have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically," Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study noted.

OpenAI has advocated for looser restrictions on using copyrighted data for AI training, while also offering opt-out mechanisms for copyright owners.