A set of pirated books is at the crux of AI's copyright woes

Theara Coleman, Staff writer

September 8, 2023 at 5:16 AM·4 min read

neon blue book icon on digital background D3Damon / Getty Images

Copyright activists have set their sights on many pirated books used to train generative artificial intelligence tech. The Rights Alliance, a Danish anti-piracy group, has taken "a multifaceted approach to its quest to obliterate" the Books3 dataset, created by researcher Shawn Presser in 2020, Wired reported. The group has successfully gotten a few sites to stop hosting the dataset, but its plans don't end there. Despite being a relatively small group, they have made "a surprising amount of progress," the outlet added.

The war between creatives and Big Tech companies over the use of copyrighted materials is being fought on multiple fronts, and Books3 is present in many of the conflicts. Presser saw his project "as aligned with the open source movement" and created it as "a way to democratize access to the kind of data sets OpenAI was already using," Wired noted. If activists scrub it from the internet, the move might effectively gatekeep access to the burgeoning industry.

Books3 began as a "passion project" that leveled the playing field

There isn't much information about what is inside OpenAI's training data sets. Still, Presser suspected the company sourced their data from a digital "shadow library," which hosts thousands of pirated texts. He and a few colleagues decided to try to create an open-access version of what OpenAI was likely using. "We were like, OK, there's actually not that much standing in the way of us doing this ourselves," Presser told Wired.

Presser found what he was looking for at The Eye, a data archiving group hosting links to books from the shadow library, Bibliotik. "I was like, jackpot," he mused. He was able to scrape and convert a library of around 196,000 books. The data set includes work by Stephen King, Margaret Atwood, and Zadie Smith, per a recent analysis by The Atlantic. He named it after OpenAI's enigmatic Books1 and Books2 sets. He contacted The Eye to host it, and Books3 went online in October 2020. Some of his collaborators founded EleutherAI, a non-profit AI collective that offered Books3 in a more extensive open-access data set, Wired reported.

The project started as "a passion project by a Midwestern guy going through a weird time," the outlet noted. "I poured my soul into the work," Presser told Wired. He saw his creation as "aligned with the open source movement" and "a way to democratize access to the kind of data sets OpenAI was already using," Wired summarized.

Now, the data set is central to the tension between artists' rights and AI innovation

Despite his best intentions, for his critics, the open source data set "isn't a boon to society—instead, it's emblematic of everything wrong with generative AI," Wired added. Copyright owners have been on edge about their content being scraped without their consent, and the existence of Books3 confirms those fears.

The Rights Alliance filed Digital Millennium Copyright Act (DMCA) notices against The Eye and other sites hosting Books3, resulting in it being taken down. That doesn't mean it doesn't exist anywhere else on the internet. The group also reached out to Meta and Bloomberg about their use of the pilfered data set, and the latter said it wouldn't be using Books3 to train BloombergGPT, its LLM, in the future.

There are multiple pending lawsuits against companies that used Books3 or similar data sets to train their large language models. Comedian and author Sarah Silverman and two other writers filed a lawsuit against Meta, alleging that it violated copyright laws by using pirated work from Books3 to train LLaMA, the company's large language model. Legal experts expect the company to argue that the content falls under fair use, but the authors disagree.

Still, Presser stood behind his decision to make an open-access data set and argued that it was the only way to replicate programs like ChatGPT. He told Gizmodo that "every for-profit company does this secretly, without releasing the datasets to the public." If Books3 is eliminated, we would "live in a world where nobody except OpenAI and other billion-dollar companies have access to those books," meaning no one could make their own generative AI programs. "Only billion-dollar companies would have the resources to do that."

News

Life

Entertainment

Finance

Sports

New on Yahoo

A set of pirated books is at the crux of AI's copyright woes

Books3 began as a "passion project" that leveled the playing field

Now, the data set is central to the tension between artists' rights and AI innovation

You may also like

Recommended Stories

Dolphins owner Stephen Ross reportedly declined $10 billion for team, stadium and F1 race

What scouts think of Bronny James' NBA prospects

2024 NBA Mock Draft 7.0: Who will the Hawks take at No. 1? Our projections for every pick with lottery order now set

NFL schedule release: Chiefs to host Ravens in 2024 season opener

Your favorite WNBA rookies didn’t make the cut. So what’s their path back to the league?

The Spin: Making a call on 5 slumping fantasy baseball stars

Where does Jared Goff’s $212M extension leave Dak Prescott and Cowboys?

Utility stocks are on fire — here are Wall Street analysts' top picks

Former MLB infielder, Little League World Series star Sean Burroughs dies at 43

MLB Power Rankings: Phillies lead Dodgers, Braves as trio of NL contenders top this week's list

Best used cars to buy in 2024: From trucks and SUVs to EVs

Here's 1 big investing mistake you are probably still making

LeBron James greeted with rousing ovation from Cavaliers fans while sitting courtside for Celtics game

The best RBs for 2024 fantasy football, according to our experts

Juan Soto’s unapologetic intensity and showmanship are captivating the Bronx and rubbing off on teammates: ‘Literally every pitch is theater’

Lions reportedly sign QB Jared Goff to 4-year, $212 million extension

Architect of PGA-LIV framework agreement resigns in frustration: 'No meaningful progress' toward a deal

The best budgeting apps for 2024

Timberwolves coach Chris Finch calls Jamal Murray's heat-pack toss on court 'inexcusable and dangerous'

Six quarterback situations to worry about & three that are on the precipice | Zero Blitz