Wikipedia Will Survive A.I.

Stephen Harrison

August 24, 2023 at 2:30 PM·9 min read

A laptop open to ChatGPT is illustrated with lots of yellow asterisks. — Photo illustration by Slate. Photo by Nicolas Maeterlinck/BELGA MAG/AFP via Getty Images.

Welcome to Source Notes, a Future Tense column about the internet’s information ecosystem.

Wikipedia is, to date, the largest and most-read reference work in human history. But the editors who update and maintain Wikipedia are certainly not complacent about its place as the preeminent information resource, and are worried about how it might be displaced by generative A.I. At last week’s Wikimania, the site’s annual user conference, one of the sessions was “ChatGPT vs. WikiGPT,” and a panelist at the event mentioned that rather than visiting Wikipedia, people seem to being going to ChatGPT for their information needs. Veteran Wikipedians have couched ChatGPT as an existential threat, predicting that A.I. chatbots will supplant Wikipedia in the same way that Wikipedia infamously dethroned Encyclopedia Britannica back in 2005.

But it seems to me that rumors of the imminent “death of Wikipedia” at the hands of generative A.I. are greatly exaggerated. Sure, the implementation of A.I. technology will undoubtedly alter how Wikipedia is used and transform the user experience. At the same time, the features and bugs of large language models, or LLMs, like ChatGPT intersect with human interests in ways that support Wikipedia rather than threaten it.

For context, there have been elements of artificial intelligence and machine learning on Wikipedia since 2002. Automated bots on Wikipedia must be approved, as set forth in the bot policy, and generally must be supervised by a human. Content review is assisted by bots such as ClueBot NG, which identifies profanity and unencyclopedic punctuation like “!!!11.” Another use case is machine translation, which has helped provide content for the 334 different language versions of the encyclopedia, again generally with human supervision. “At the end of the day, Wikipedians are really, really practical—that’s the fundamental characteristic,” said Chris Albon, director of machine learning at the Wikimedia Foundation, the nonprofit organization that supports the project. “Wikipedians have been using A.I. and M.L. from 2002 because it just saved time in ways that were useful to them.”

In other words, bots are old news for Wikipedia—it’s the offsite LLMs that present new challenges. Earlier this year, I reported on how Wikipedians were grappling with the then-new ChatGPT and deciding whether chatbot-generated content should be used in the process of composing Wikipedia articles. At the time, the editors were understandably concerned with how LLMs hallucinate, responding to prompts with outright fabrications complete with fake citations. There is a real risk that users who copy ChatGPT text into Wikipedia would risk polluting the project with misinformation. But an outright ban on generative A.I. seemed both too harsh and too Luddite—a failure to recognize new ways of working. Some editors have reported that ChatGPT answers were useful as a starting point or a skeletal outline. While banning generative A.I. could keep low-quality ChatGPT content off of Wikipedia, it could also curtail the productivity of human editors.

These days, Wikipedians are in the process of drafting a policy for how LLMs can be used on the project. What’s being discussed is essentially a “take care and declare” framework: The human editor must disclose in an article’s public edit history that an LLM was used and must take personal responsibility for vetting the LLM content and ensuring its accuracy. It’s worth noting that the proposed policy for LLMs is very similar to how most Wikipedia bots require some human supervision. Leash your bots, your dogs, and now your LLMs.

To be clear, the Wikipedia community has jurisdiction over how their fellow editors use bots—but not how external agents are using Wikipedia. These days, generative A.I. companies are taking advantage of the internet encyclopedia’s open license. Every LLM so far has been trained on Wikipedia’s content, and the site is almost always the largest source of training data within their data sets.

Despite swallowing Wikipedia’s entire corpus, ChatGPT is not the polite sort of robot that graciously credits Wikipedia when it uses that information for one of its responses. Quite the contrary—the chatbot doesn’t typically disclose its sources at all. Critics are advocating for greater transparency, and advocating restraint until chatbots become an explainable A.I. system.

Of course, there’s a scary reason that LLMs don’t normally credit their sources: the A.I. does not always know how it has arrived at its answer. Pardon the grotesque simile, but the knowledge base of a typical LLM is like a huge hairball; the LLM may pull strands from Wikipedia, Tumblr, Reddit, and a variety of other sources without distinguishing among them. And the LLM is basically programmed solely to predict the next phrase, not to provide credit when it’s due.

Journalists in particular seem very concerned about how ChatGPT isn’t acknowledging Wikipedia in its responses. The New York Times Magazine published a feature last month on how the reuse of Wikipedia information by A.I. imperiled Wikipedia’s health and made people forget about its important role behind the scenes.

But I get the sense that most Wikipedia contributors are less concerned about credit-claiming than the average reporter. For one thing, Wikipedians are used to this: After all, before LLMs, Siri and Alexa were the ones scraping Wikipedia without credit. (As of publication time, these smart assistants have been updated to say something like “from Wikipedia.”) More fundamentally, there has always been an altruistic element in curating information for Wikipedia: People add knowledge to the site expecting that everyone else will use it how they will.

Rather than sapping away the morale of volunteer human Wikipedians, generative A.I. may add a new reason to the list of their motivations: a sincere desire to train the robots. This is also a reason that generative A.I. companies like OpenAI should care about maintaining Wikipedia’s role as ChatGPT’s primary tutor. It’s important for Wikipedia to remain a human-written knowledge source. We now know that LLM-generated content is like poison for training LLMs: If the training data is not human-created, then LLMs become measurably dumber. LLMs that eat too much of their own cooking are prone to model collapse, a symptom of the curse of recursion.

As Selena Deckelmann, the Wikimedia Foundation’s chief product and technology officer, put it, “the world’s generative AI companies need to figure out how to keep sources of original human content, the most critical element of our information system, sustainable and growing over time.” This mutual interest is perhaps why Google.org, the Musk Foundation, Facebook, and Amazon are among the benefactors who have donated more than a million dollars to the Wikimedia Endowment—A.I. companies seem to have realized that keeping Wikipedia a human-created project is in their interests. (For further context, the foundation is primarily supported by numerous small donations by ordinary Wikipedia readers and supporters, which is comforting for those of us who worry about any big tech company gaining too much influence over the direction of the nonprofit organization.)

The weaknesses of A.I. chatbots could also popularize new use cases for Wikipedia. In July, the Wikimedia Foundation released a new Wikipedia ChatGPT plug-in that allows ChatGPT to search for and summarize the most up-to-date information on Wikipedia to answer general knowledge queries. For instance, if you ask ChatGPT 3.5 in its standard form about Donald Trump’s indictment, the chatbot says it doesn’t know about it because it is only trained on the internet through September 2021. But with the new plug-in, the chatbot accurately summarizes current events. Notice how Wikipedia in this example is functioning something like a water filter: sitting on the tap of the raw LLM, rooting out inaccuracies, and bringing the content up to speed.

Whether Wikipedia is incorporated into A.I. via the training data or as a plug-in, it’s clear that it’s important to keep humans interested in curating information for the site. Albon told me about several proposals to leverage LLMs to help make the editing process more enjoyable. One idea proposed by the community is to allow LLMs to summarize the lengthy discussions on talk pages, the non-article spaces where editors delve into the site’s policies. Since Wikipedia is more than 20 years old, some of these walls of texts are now lengthier than War and Peace. Few people have the time to review all of the discussion that has taken place since 2005 about what qualifies as a reliable source for Wikipedia, much less perennial sources. Rather than expecting new contributors to review multiyear discussions about the issue, the LLM could just summarize them at the top. “The reason that’s important is to draw in new editors, to make it so it’s not so daunting,” Albon said.

John Samuel, an assistant professor of computer science at CPE Lyon, told me that prospective Wikipedia editors he’s recruited often find it difficult to get started. Finding reliable sources to use for an article can be very labor-intensive, and Gen Z has grown impatient with the chore of sifting through Google search results. An internet that has become flooded with machine-generated content will make the process of finding quality sources even more painful.

But Samuel foresees a hopeful future in which Wikipedia has integrated some A.I. technology that helps human editors find quality sources and double checks to ensure that the underlying sources in fact state what the human claims. “We cannot delay things. We have to think about integrating the newer A.I.-based tools so that we save the time of contributors,” Samuel said.

If there’s a common theme running through the A.I.-gloom discourse, it’s that A.I. is going to take people’s jobs. And what about the “job” of volunteer Wikipedia editors? The answer is nuanced. On the one hand, a lot of repetitive work (adding article categories, basic formatting, easy summaries) is likely to be automated. Then again, the work of the people editing Wikipedia has never really been about writing text, per se. The more important job has always involved discussions between members of the community, debates about whether one source or the other is more reliable, arguments about whether wording is representative or misleading, trying to collaborate with the shared goal of improving the encyclopedia. So perhaps that’s where the future is heading for Wikipedia: leave the polite busywork for the A.I., but keep the discourse and the disagreement—that messy, meaningful, consensus-building stuff—for humans.

Yahoo Sports
2024 NFL Draft grades: Denver Broncos earn one of our lowest grades mostly due to one pick
Yahoo Sports' Charles McDonald breaks down the Broncos' 2024 draft.
4d ago
Yahoo Sports
How to watch the 2024 WNBA preseason: Caitlin Clark’s first Indiana Fever game time, channel and more
The WNBA preseason tips off this Friday. Here's how you can catch Caitlin Clark's first game.
10h ago
Yahoo Sports
NFL Power Rankings, draft edition: Did Patriots fix their offensive issues?
Which teams did the best in the NFL Draft?
2d ago
Yahoo Sports
Formula 1: Miami Grand Prix sends cease and desist letter to prevent Donald Trump fundraiser during race
Race organizers say they'll revoke a Trump fundraiser's suite license if he holds an event for the former president on Sunday at the race.
3d ago
Yahoo Sports
NFL Draft grades for all 32 teams | Zero Blitz
Jason Fitz and Frank Schwab join forces to recap the draft in the best way they know how: letter grades! Fitz and Frank discuss all 32 teams division by division as they give a snapshot of how fans should be feeling heading into the 2024 season. The duo have key debates on the Dallas Cowboys, New York Giants, New Orleans Saints, Los Angeles Rams, New England Patriots, Las Vegas Raiders and more.
3d ago
Yahoo Life Shopping
Does castor oil really help with hair growth? We asked the experts, and their answer may surprise you
It's inexpensive, but is it effective? Dermatologists' verdict is in — and it's unanimous.
2d ago
Yahoo Sports
NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in
Not everyone was thrilled with their team's draft on Thursday night.
7d ago
Yahoo Sports
The best RBs for 2024 fantasy football according to our experts
The Yahoo Fantasy football analysts reveal their first running back rankings for the 2024 season.
2d ago
Yahoo Finance
CVS stock plunges after earnings numbers one analyst 'did not even believe'
CVS warns it could cede Medicare Advantage market share as reimbursement rates pressure the company.
1d ago
Yahoo Sports
New details emerge in alleged gambling ring behind Shohei Ohtani-Ippei Mizuhara scandal
It turns out the money was going from Ohtani's bank account to an illegal bookie to ... casinos.
2d ago
Yahoo Sports
Canelo Álvarez and Oscar De La Hoya erupt in heated exchange ahead of title bout with Jaime Munguía
Canelo Álvarez is set to defend his title against undefeated Jaime Munguía on Saturday in Las Vegas.
1d ago
Yahoo Sports
MLB Power Rankings: Braves move into the top spot followed by Dodgers, Phillies as injuries take a toll across the league
From the Braves to the Marlins, here's where all 30 teams stand after the season's first month.
2d ago
Yahoo Sports
NBA playoffs: Luka Dončić leads Mavericks in blowout win over Clippers to take 3-2 series lead
Luka Dončić singlehandedly outscored Paul George, James Harden and Russell Westbrook.
22h ago
Yahoo Sports
Wide receiver rankings for 2024 fantasy football
The Yahoo Fantasy football analysts reveal their first wide receiver rankings for the 2024 season.
2d ago
Yahoo News
2nd Boeing whistleblower found dead. Here's a timeline of the company's mounting problems.
This is the second Boeing whistleblower to die in the last two months.
8h ago
Yahoo Sports
Ex-Florida State QB and 1999 Fiesta Bowl starter Marcus Outzen dies at 46
The Seminoles lost 23-16 to Tennessee in the first-ever BCS title game.
1d ago
Yahoo Sports
The expanded 12-team College Football Playoff is here — and it already has problems
There is cause for excitement around the new playoff format. There's also lots of complaints and criticism to go around.
4d ago
Autoblog
10 cars with paint problems, according to Consumer Reports
Consumer Reports shares the ten vehicles most prone to paint problems, and they span quite an array of models.
2d ago
Yahoo Sports
MLB odds: Elly De La Cruz on pace for historic season, and his MVP odds are dropping
Elly De La Cruz was a thrill ride in April.
1d ago
Yahoo Sports
2024 NFL Draft grades for all 32 teams
While the Falcons baffled everyone, the defending champion Chiefs looked like champs again.
4d ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Wikipedia Will Survive A.I.

Recommended Stories

2024 NFL Draft grades: Denver Broncos earn one of our lowest grades mostly due to one pick

How to watch the 2024 WNBA preseason: Caitlin Clark’s first Indiana Fever game time, channel and more

NFL Power Rankings, draft edition: Did Patriots fix their offensive issues?

Formula 1: Miami Grand Prix sends cease and desist letter to prevent Donald Trump fundraiser during race

NFL Draft grades for all 32 teams | Zero Blitz

Does castor oil really help with hair growth? We asked the experts, and their answer may surprise you

NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in

The best RBs for 2024 fantasy football according to our experts

CVS stock plunges after earnings numbers one analyst 'did not even believe'

New details emerge in alleged gambling ring behind Shohei Ohtani-Ippei Mizuhara scandal

Canelo Álvarez and Oscar De La Hoya erupt in heated exchange ahead of title bout with Jaime Munguía

MLB Power Rankings: Braves move into the top spot followed by Dodgers, Phillies as injuries take a toll across the league

NBA playoffs: Luka Dončić leads Mavericks in blowout win over Clippers to take 3-2 series lead

Wide receiver rankings for 2024 fantasy football

2nd Boeing whistleblower found dead. Here's a timeline of the company's mounting problems.

Ex-Florida State QB and 1999 Fiesta Bowl starter Marcus Outzen dies at 46

The expanded 12-team College Football Playoff is here — and it already has problems

10 cars with paint problems, according to Consumer Reports

MLB odds: Elly De La Cruz on pace for historic season, and his MVP odds are dropping

2024 NFL Draft grades for all 32 teams