170,000-plus books used to train AI; authors say they weren’t asked

An investigation by The Atlantic indicated thousands of e-books are being used to train an artificial intelligence system called Books3.
An investigation by The Atlantic indicated thousands of e-books are being used to train an artificial intelligence system called Books3. | Adobe Stock
  • Oops!
    Something went wrong.
    Please try again later.
  • Oops!
    Something went wrong.
    Please try again later.

Authors are upset after tech companies started using their books to train artificial intelligence without letting them know or seeking their permission. They worry about copyright infringement and loss of income, among other issues.

Per CNN, “The system is called Books3, and according to an investigation by The Atlantic, the data set is based on a collection of pirated e-books spanning all genres, from erotic fiction to prose poetry. Books help generative AI systems with learning how to communicate information.”

“The future promised by AI is written with stolen words,” The Atlantic article said.

The article notes that some of the text that’s training AI on how to use language is taken from Wikipedia and other online entries. But “high-quality generative AI requires higher-quality input than is usually found on the internet — that is, it requires the kind found in books.”

Many authors apparently don’t view the use of their books to train artificial intelligence as an honor. Rather, it’s a shortcut that robs them of their due, they say.

CNN reported that Nora Roberts, who writes romantic novels, has 206 books in the database — “second only to William Shakespeare.” She told CNN the database is “all kinds of wrong. We are human beings, we are writers and we are being exploited by people who want to use our work, again without permission or compensation, to ‘write’ books, scripts, essays because it’s cheap and easy,” she said in a statement to CNN.

Per The Atlantic, Sarah Silverman, Richard Kadrey and Christopher Golden filed a lawsuit in California that claims Meta — owner of Facebook — violated their copyrights by using their books to train the company’s large language model LLaMA. That’s an algorithm that competes with OpenAI’s GPT-4 to create its own text by using word patterns it learned from the books and other sources, the article said.

The Atlantic’s Alex Reisner created a stir when he got a list of the books and published a searchable database so that anyone can see if their favorite author’s work is being used to teach AI communication skills. He notes the authors include well-known names like Stephen King, John Kratz and James Patterson, among others. The books apparently came through web-crawling technology that found bootleg PDF copies of the books for free online and they were then packaged into a database called Books3, where different AI companies are using them. Bloomberg said it will not use Books3 in the future as it trains its BloombergGPT.

Related

The Authors Guild on Sept. 27 published a guide on actions authors can take if they learned their books are in the Books3 dataset. “This can be an unsettling revelation, raising concerns about copyright, compensation and the future implications of AI,” the article said.

The guild and 17 authors filed a different class-action suit in New York against OpenAI for copyright infringement. Those authors, per a separate guild article, include David Baldacci, Mary Bly, Michael Connelly, John Grisham, Jodi Picoult, Scott Turow and Rachel Vail, among others.

“The complaint draws attention to the fact that the plaintiffs’ books were downloaded from pirate ebook repositories and then copied into the fabric of GPT 3.5 and GPT 4 which power ChatGPT and thousands of applications and enterprise uses — from which OpenAI expects to earn many billions, the article said.

Reisner also wrote that while Meta is using authors’ books without permission, it employed a “takedown” order against at least one developer who used LLaMA coding after it was leaked a few months ago, on the claim that “no one is authorized to exhibit, reproduce, transmit or otherwise distribute Meta Properties without the express written permission of Meta.” And once it decided to make LLaMA open-source, Meta still requires developers to get a license in order to use it.

Not everyone’s upset, however, by use of their work to train AI. Ian Bogost, author of “Play Anything: The Pleasure of Limits, the Uses of Boredom and the Secret of Games,” among other works, wrote a column for The Atlantic titled “My Books Were Used to Train Meta’s Generative AI. Good.” And he promised “It can have my next one, too.”

Bogost contends that successful art “exceeds its creator’s plans,” noting that an author cannot accurately predict a book’s audience. “Who am I to say what my work is good for, how it might benefit someone — even a near-trillion-dollar company? To bemoan this one unexpected use for my writing is to undermine all of the other unexpected uses for it. Speaking as a writer, that makes me feel bad.”