AI's future could hinge on one thorny legal question

ChatGPT and The New York Times logos are seen in this illustration taken December 27, 2023. REUTERS/Dado Ruvic/Illustration

If a media outlet copied a bunch of New York Times stories and posted them on its site, that would probably be seen as a blatant violation of the Times's copyright.

But what about when a tech company copies those same articles, combines them with countless other copied works, and uses them to train an AI chatbot capable of conversing on almost any topic - including the ones it learned about from the Times?

Subscribe to The Post Most newsletter for the most important and interesting stories from The Washington Post.

That's the legal question at the heart of a lawsuit the Times filed against OpenAI and Microsoft in federal court last week, alleging that the tech firms illegally used "millions" of copyrighted Times articles to help develop the AI models behind tools such as ChatGPT and Bing. It's the latest, and some believe the strongest, in a bevy of active lawsuits alleging that various tech and artificial intelligence companies have violated the intellectual property of media companies, photography sites, book authors and artists.

Together, the cases have the potential to rattle the foundations of the booming generative AI industry, some legal experts say - but they could also fall flat. That's because the tech firms are likely to lean heavily on a legal concept that has served them well in the past: the doctrine known as "fair use."

Broadly speaking, copyright law distinguishes between ripping off someone else's work verbatim - which is generally illegal - and "remixing" or putting it to a new, creative use. What is confounding about AI systems, said James Grimmelmann, a professor of digital and information law at Cornell University, is that in this case they seem to be doing both.

Generative AI represents "this big technological transformation that can make a remixed version of anything," Grimmelmann said. "The challenge is that these models can also blatantly memorize works they were trained on, and often produce near-exact copies," which, he said, is "traditionally the heart of what copyright law prohibits."

From the first VCRs, which could be used to record TV shows and movies, to Google Books, which digitized millions of books, U.S. companies have convinced courts that their technological tools amounted to fair use of copyrighted works. OpenAI and Microsoft are already mounting a similar defense.

"We believe that the training of AI models qualifies as a fair use, falling squarely in line with established precedents recognizing that the use of copyrighted materials by technology innovators in transformative ways is entirely consistent with copyright law," OpenAI wrote in a filing to the U.S. Copyright Office in November.

AI systems are typically "trained" on gargantuan data sets that include vast amounts of published material, much of it copyrighted. Through this training, they come to recognize patterns in the arrangement of words and pixels, which they can then draw on to assemble plausible prose and images in response to just about any prompt.

Some AI enthusiasts view this process as a form of learning, not unlike an art student devouring books on Monet or a news junkie reading the Times cover-to-cover to develop their own expertise. But plaintiffs see a more quotidian process at work beneath these models' hood: It's a form of copying, and unauthorized copying at that.

"It's not learning the facts like a brain would learn facts," said Danielle Coffey, chief executive of the News/Media Alliance, a trade group that represents more than 2,000 media organizations, including the Times and The Washington Post. "It's literally spitting the words back out at you."

There are two main prongs to the New York Times's case against OpenAI and Microsoft. First, like other recent AI copyright lawsuits, the Times argues that its rights were infringed when its articles were "scraped" - or digitally scanned and copied - for inclusion in the giant data sets that GPT-4 and other AI models were trained on. That's sometimes called the "input" side.

Second, the Times's lawsuit cites examples in which OpenAI's GPT-4 language model - versions of which power both ChatGPT and Bing - appeared to cough up either detailed summaries of paywalled articles, like the company's Wirecutter product reviews, or entire sections of specific Times articles. In other words, the Times alleges, the tools violated its copyright with their "output," too.

Judges so far have been wary of the argument that training an AI model on copyrighted works - the "input" side - amounts to a violation in itself, said Jason Bloom, a partner at the law firm Haynes and Boone and the chairman of its intellectual property litigation group.

"Technically, doing that can be copyright infringement, but it's more likely to be considered fair use, based on precedent, because you're not publicly displaying the work when you're just ingesting and training" with it, Bloom said. (Bloom is not involved in any of the active AI copyright suits.)

Fair use also can apply when the copying is done for a purpose different from simply reproducing the original work - such as to critique it or to use it for research or educational purposes, like a teacher photocopying a news article to hand out to a journalism class. That's how Google defended Google Books, an ambitious project to scan and digitize millions of copyrighted books from public and academic libraries so that it could make their contents searchable online.

The project sparked a 2005 lawsuit by the Authors Guild, which called it a "brazen violation of copyright law." But Google argued that because it displayed only "snippets" of the books in response to searches, it wasn't undermining the market for books but providing a fundamentally different service. In 2015, a federal appellate court agreed with Google.

That precedent should work in favor of OpenAI, Microsoft and other tech firms, said Eric Goldman, a professor at Santa Clara University School of Law and co-director of its High Tech Law Institute.

"I'm going to take the position, based on precedent, that if the outputs aren't infringing, then anything that took place before isn't infringing as well," Goldman said. "Show me that the output is infringing. If it's not, then copyright case over."

OpenAI and Microsoft are also the subject of other AI copyright lawsuits, as are rival AI firms including Meta, Stability AI and Midjourney, with some targeting text-based chatbots and others targeting image generators. So far, judges have dismissed parts of at least two cases in which the plaintiffs failed to demonstrate that the AI's outputs were substantially similar to their copyrighted works.

In contrast, the Times's suit provides numerous examples in which a version of GPT-4 reproduced large passages of text identical to that in Times articles in response to certain prompts.

That could go a long way with a jury, should the case get that far, said Blake Reid, associate professor at Colorado Law. But if courts find that only those specific outputs are infringing, and not the use of the copyrighted material for training, he added, that could prove much easier for the tech firms to fix.

OpenAI's position is that the examples in the Times's lawsuit are aberrations - a sort of bug in the system that caused it to cough up passages verbatim.

Tom Rubin, OpenAI's chief of intellectual property and content, said the Times appears to have intentionally manipulated its prompts to the AI system to get it to reproduce its training data. He said via email that the examples in the lawsuit "are not reflective of intended use or normal user behavior and violate our terms of use."

"Many of their examples are not replicable today," Rubin added, "and we continually make our products more resilient to this type of misuse."

The Times isn't the only organization that has found AI systems producing outputs that resemble copyrighted works. A lawsuit filed by Getty Images against Stability AI notes examples of its Stable Diffusion image generator reproducing the Getty watermark. And a recent blog post by AI expert Gary Marcus shows examples in which Microsoft's Image Creator appeared to generate pictures of famous characters from movies and TV shows.

Microsoft did not respond to a request for comment.

The Times did not specify the amount it is seeking, although the company estimates damages to be in the "billions." It is also asking for a permanent ban on the unlicensed use of its work. More dramatically, it asks that any existing AI models trained on Times content be destroyed.

Because the AI cases represent new terrain in copyright law, it is not clear how judges and juries will ultimately rule, several legal experts agreed.

While the Google Books case might work in the tech firms' favor, the fair-use picture was muddied by the Supreme Court's recent decision in a case involving artist Andy Warhol's use of a photograph of the rock star Prince, said Daniel Gervais, a professor at Vanderbilt Law and director of its intellectual property program. The court found that if the copying is done to compete with the original work, "that weighs against fair use" as a defense. So the Times's case may hinge in part on its ability to show that products like ChatGPT and Bing compete with and harm its business.

"Anyone who's predicting the outcome is taking a big risk here," Gervais said. He said for business plaintiffs like the New York Times, one likely outcome might be a settlement that grants the tech firms a license to the content in exchange for payment. The Times spent months in talks with OpenAI and Microsoft, which holds a major stake in OpenAI, before the newspaper sued, the Times disclosed in its lawsuit.

Some media companies have already struck arrangements over the use of their content. Last month, OpenAI agreed to pay German media conglomerate Axel Springer, which publishes Business Insider and Politico, to show parts of articles in ChatGPT responses. The tech company has also struck a deal with the Associated Press for access to the news service's archives.

A Times victory could have major consequences for the news industry, which has been in crisis since the internet began to supplant newspapers and magazines nearly 20 years ago. Since then, newspaper advertising revenue has been in steady decline, the number of working journalists has dropped dramatically and hundreds of communities across the country no longer have local newspapers.

But even as publishers seek payment for the use of their human-generated materials to train AI, some also are publishing works produced by AI - which has prompted both backlash and embarrassment when those machine-created articles are riddled with errors.

Cornell's Grimmelmann said AI copyright cases might ultimately hinge on the stories each side tells about how to weigh the technology's harms and benefits.

"Look at all the lawsuits, and they're trying to tell stories about how these are just plagiarism machines ripping off artists," he said. "Look at the [AI firms' responses], and they're trying to tell stories about all the really interesting things these AIs can do that are genuinely new and exciting."

Reid of Colorado Law noted that tech giants may make less sympathetic defendants today for many judges and juries than they did a decade ago when the Google Books case was being decided.

"There's a reason you're hearing a lot about innovation and open-source and start-ups" from the tech industry, he said. "There's a race to frame who's the David and who's the Goliath here."

Related Content

How Trump reignited his base and took control of the Republican primary

What happened to Wall Street's post-George Floyd bet on Black banking?

Resignation at Harvard latest but not last salvo in GOP war on colleges