Meta's AI Is Partially Trained on Breitbart and Russia Today, Study Finds

Kyle Barr

April 19, 2023 at 12:07 PM·4 min read

3D rendering artificial intelligence AI research of robot and cyborg development for future of people living. Digital data mining and machine learning technology design for computer brain.

As much as people think AI is ‘intelligent,’ what goes into training our modern AI models informs what kind of information it puts out.

AI is sophisticated, but it’s not really intelligent. Today’s large language models used to power programs like ChatGPT, are amalgamations of scraped text found on the internet. So when Meta introduced its “state of the art” LLaMA AI back in February, eyes turned to some of the datasets used to train it, especially the Google-made “Colossal Clean Crawled Corpus,” or C4. It turns out, like its namesake, some of the scraped text truly blows.

So just how explosive is this C4 data set? An analysis of the scraped data from The Washington Post Wednesday shows C4 mostly relied on some heinous sources for its text. The top four most-used sites were Google Patents (making up .46% of all tokens), Wikipedia (.19%), Scribd (.07%), and The New York Times’ website (.06%). At the same time, C4 used large swaths of text from Russian propaganda site Russia Today and the ultra-right-wing Breitbart. Both those were in the top 200 sites trawled for text.

The Post worked alongside researchers at the Allen Institute for AI who recreated the data set. Some sites are far less present in the training data but are notable for their atrocious content. Stormfront, a site for white supremacists, was included in the data, ranked 27,505. Kiwi Farms, the site known for its vile online harassment campaigns, made up .00004% of tokens. 4chan, and all its wild conspiracy theories, was also included in the data, though ranked in lowly 484,297th place. There’s other small instances of text scraped from sites promoting conspiracies, porn, and hate content. Meta and Google did not immediately respond to requests for comment.

In addition, the training data took data from half a million personal blogs from sites like Medium, Blogspot and WordPress. The dataset includes text from Kickstarter, Etsy and Patreon, scraping the text and style of people promoting their work online. Two of the largest scraped sites included voter registration databases for Colorado and Florida. Though both sites are technically public information, the data may have scraped private citizens’ data.

This particular data set has been used on other major AI projects other than Meta’s LLaMA, such as Google’s T5 text-to-text AI transformer model. According to Google, C4 was originally developed by the company as a “cleaned version” of the nonprofit Common Crawl’s AI training data. Google said it removed offensive or “noisy” content from the dataset, including any dirty language and offensive slurs. Google’s LaMDA AI, which is used for the company’s Bard chatbot, is something of a black box. It was trained on a data set called Infiniset, which is described as 1.56 trillion dialogs (words used in context), 50% of which comes from public forums. Another 12.5% of its training set is C4 data, while the rest comes from English language Wikipedia and other web documents.

According to the research paper released alongside LLaMA, 15% of its pre-training data came from C4. Another 67% came from filtered CommonCrawl dumps from 2017 to 2020. The rest of its data comes directly from sites like Wikipedia, the Gutenberg Project, and GitHub. Last year, a programmer sued GitHub for its AI assistant tool saying it was taking his and other coders work without permission.

The Post’s report is all the more enlightening considering just how hard it is to actually find information about AI training. OpenAI did not reveal a single bare detail of its GPT-4 LLM released last month, citing the “competitive landscape” of AI development. Knowing what goes into the training can help explain the certain biases of outputs. Researchers recently showed how ChatGPT can be used to produce overtly racist responses through some simple prompt engineering.

The Allen Institute also included their own search function for users to see if C4 used their text. A quick search for “Gizmodo” shows the dataset scraped thousands of articles from and about our site from throughout the 2010s. According to the Post’s count, our site is only ranked 275 compared to RT and Breitbart.

Want to know more about AI, chatbots, and the future of machine learning? Check out our full coverage of artificial intelligence, or browse our guides to The Best Free AI Art Generators, The Best ChatGPT Alternatives, and Everything We Know About OpenAI’s ChatGPT.

News

Life

Entertainment

Finance

Sports

New on Yahoo

Meta's AI Is Partially Trained on Breitbart and Russia Today, Study Finds

Recommended Stories

NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in

NFL to allow players to wear protective Guardian Caps in games beginning with 2024 season

Carolina Panthers owner David Tepper stopped by Charlotte bar that criticized his draft strategy

Jamie Dimon is worried the US economy is headed back to the 1970s

Korey Cunningham, former NFL lineman, found dead in New Jersey home at age 28

Based on the odds, here's what the top 10 picks of the NFL Draft will be

Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill

Fantasy Baseball Waiver Wire: Widely available players ready to help your squad

Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion

Ryan Garcia drops Devin Haney 3 times en route to stunning upset

Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal

Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal

These are the cars being discontinued for 2024 and beyond

Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions

The Buzz: Fantasy baseball's polarizing hitters — is Mike Trout really back?

NBA playoffs: Who's had the most impressive start to the postseason? Most surprising?

Here’s when people think old age begins — and why experts think it’s starting later

Retirement confidence in the US ticks up; new rule for financial advisers is set to start

Donald Trump nabs additional $1.2 billion 'earnout' bonus from DJT stock

Reggie Bush celebrates return of Heisman Trophy, calls out NCAA with defamation suit still pending: 'I never once cheated'