News outlets are accusing Perplexity of plagiarism and unethical web scraping

Rebecca Bellan

July 2, 2024 at 11:00 AM·9 min read

In the age of generative AI, when chatbots can provide detailed answers to questions based on content pulled from the internet, the line between fair use and plagiarism, and between routine web scraping and unethical summarization, is a thin one.

Perplexity AI is a startup that combines a search engine with a large language model that generates answers with detailed responses, rather than just links. Unlike OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity doesn’t train its own foundational AI models, instead using open or commercially available ones to take the information it gathers from the internet and translate that into answers.

But a series of accusations in June suggests the startup’s approach borders on being unethical. Forbes called out Perplexity for allegedly plagiarizing one of its news articles in the startup’s beta Perplexity Pages feature. And Wired has accused Perplexity of illicitly scraping its website, along with other sites.

Perplexity, which as of April was working to raise $250 million at a near-$3 billion valuation, maintains that it has done nothing wrong. The Nvidia- and Jeff Bezos-backed company says that it has honored publishers’ requests to not scrape content and that it is operating within the bounds of fair use copyright laws.

The situation is complicated. At its heart are nuances surrounding two concepts. The first is the Robots Exclusion Protocol, a standard used by websites to indicate that they don’t want their content accessed or used by web crawlers. The second is fair use in copyright law, which sets up the legal framework for allowing the use of copyrighted material without permission or payment in certain circumstances.

Surreptitiously scraping web content

Wired's June 19 story claims that Perplexity has ignored the Robots Exclusion Protocol to surreptitiously scrape areas of websites that publishers do not want bots to access. Wired reported that it observed a machine tied to Perplexity doing this on its own news site, as well as across other publications under its parent company, Condé Nast.

The report noted that developer Robb Knight conducted a similar experiment and came to the same conclusion.

Both Wired reporters and Knight tested their suspicions by asking Perplexity to summarize a series of URLs and then watching on the server side as an IP address associated with Perplexity visited those sites. Perplexity then “summarized” the text from those URLs — though in the case of one dummy website with limited content that Wired created for this purpose, it returned text from the page verbatim.

This is where the nuances of the Robots Exclusion Protocol come into play.

Web scraping is technically when automated pieces of software known as crawlers scour the web to index and collect information from websites. Search engines like Google do this so that web pages can be included in search results. Other companies and researchers use crawlers to gather data from the internet for market analysis, academic research and, as we’ve come to learn, training machine learning models.

Web scrapers in compliance with this protocol will first look for the “robots.txt” file in a site’s source code to see what is permitted and what is not — today, what is not permitted is usually scraping a publisher’s site to build massive training datasets for AI. Search engines and AI companies, including Perplexity, have stated that they comply with the protocol, but they aren’t legally obligated to do so.

Perplexity’s head of business, Dmitry Shevelenko, told TechCrunch that summarizing a URL isn’t the same thing as crawling. “Crawling is when you’re just going around sucking up information and adding it to your index,” Shevelenko said. He noted that Perplexity’s IP might show up as a visitor to a website that is “otherwise kind of prohibited from robots.txt” only when a user puts a URL into their query, which “doesn’t meet the definition of crawling.”

“We’re just responding to a direct and specific user request to go to that URL,” Shevelenko said.

In other words, if a user manually provides a URL to an AI, Perplexity says its AI isn’t acting as a web crawler but rather a tool to assist the user in retrieving and processing information they requested.

But to Wired and many other publishers, that’s a distinction without a difference because visiting a URL and pulling the information from it to summarize the text sure looks a whole lot like scraping if it's done thousands of times a day.

(Wired also reported that Amazon Web Services, one of Perplexity’s cloud service providers, is investigating the startup for ignoring robots.txt protocol to scrape web pages that users cited in their prompt. AWS told TechCrunch that Wired’s report is inaccurate and that it told the outlet it was processing their media inquiry like it does any other report alleging abuse of the service.)

Plagiarism or fair use?

screenshot of Perplexity Pages — Forbes accused Perplexity of plagiarizing its scoop about former Google CEO Eric Schmidt developing AI-powered combat drones.

Wired and Forbes have also accused Perplexity of plagiarism. Ironically, Wired says Perplexity plagiarized the very article that called out the startup for surreptitiously scraping its web content.

Wired reporters said the Perplexity chatbot “produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them.” One sentence exactly reproduces a sentence from the original story; Wired says this constitutes plagiarism. The Poynter Institute’s guidelines say it might be plagiarism if the author (or AI) used seven consecutive words from the original source work.

Forbes also accused Perplexity of plagiarism. The news site published an investigative report in early June about how Google CEO Eric Schmidt’s new venture is recruiting heavily and testing AI-powered drones with military applications. The next day, Forbes editor John Paczkowski posted on X saying that Perplexity had republished the scoop as part of its beta feature, Perplexity Pages.

Perplexity Pages, which is only available to certain Perplexity subscribers for now, is a new tool that promises to help users turn research into “visually stunning, comprehensive content,” according to Perplexity. Examples of such content on the site come from the startup’s employees, and include articles like “Beginner’s Guide to Drumming,” or “Steve Jobs: Visionary CEO.”

“It rips off most of our reporting,” Paczkowski wrote. “It cites us, and a few that reblogged us, as sources in the most easily ignored way possible.”

Forbes reported that many of the posts that were curated by the Perplexity team are “strikingly similar to original stories from multiple publications, including Forbes, CNBC and Bloomberg.” Forbes said the posts gathered tens of thousands of views and didn’t mention any of the publications by name in the article text. Rather, Perplexity’s articles included attributions in the form of “small, easy-to-miss logos that link out to them.”

Furthermore, Forbes said the post about Schmidt contains “nearly identical wording” to Forbes’ scoop. The aggregation also included an image created by the Forbes design team that appeared to be slightly modified by Perplexity.

Perplexity CEO Aravind Srinivas responded to Forbes at the time by saying the startup would cite sources more prominently in the future — a solution that’s not foolproof, as citations themselves face technical difficulties. ChatGPT and other models have hallucinated links, and since Perplexity uses OpenAI models, it is likely to be susceptible to such hallucinations. In fact, Wired reported that it observed Perplexity hallucinating entire stories.

Other than noting Perplexity’s “rough edges,” Srinivas and the company have largely doubled down on Perplexity’s right to use such content for summarizations.

This is where the nuances of fair use come into play. Plagiarism, while frowned upon, is not technically illegal.

According to the U.S. Copyright Office, it is legal to use limited portions of a work including quotes for purposes like commentary, criticism, news reporting and scholarly reports. AI companies like Perplexity posit that providing a summary of an article is within the bounds of fair use.

“Nobody has a monopoly on facts,” Shevelenko said. “Once facts are out in the open, they are for everyone to use.”

Shevelenko likened Perplexity’s summaries to how journalists often use information from other news sources to bolster their own reporting.

Mark McKenna, a professor of law at the UCLA Institute for Technology, Law & Policy, told TechCrunch the situation isn’t an easy one to untangle. In a fair use case, courts would weigh whether the summary uses a lot of the expression of the original article, versus just the ideas. They might also examine whether reading the summary might be a substitute for reading the article.

“There are no bright lines,” McKenna said. “So [Perplexity] saying factually what an article says or what it reports would be using non-copyrightable aspects of the work. That would be just facts and ideas. But the more that the summary includes actual expression and text, the more that starts to look like reproduction, rather than just a summary.”

Unfortunately for publishers, unless Perplexity is using full expressions (and apparently, in some cases, it is), its summaries might not be considered a violation of fair use.

How Perplexity aims to protect itself

AI companies like OpenAI have signed media deals with a range of news publishers to access their current and archival content on which to train their algorithms. In return, OpenAI promises to surface news articles from those publishers in response to user queries in ChatGPT. (But even that has some kinks that need to be worked out, as Nieman Lab reported last week.)

Perplexity has held off from announcing its own slew of media deals, perhaps waiting for the accusations against it to blow over. But the company is “full speed ahead” on a series of advertising revenue-sharing deals with publishers.

The idea is that Perplexity will start including ads alongside query responses, and publishers that have content cited in any answer will get a slice of the corresponding ad revenue. Shevelenko said Perplexity is also working to allow publishers access to its technology so they can build Q&A experiences and power things like related questions natively inside their sites and products.

But is this just a fig leaf for systemic IP theft? Perplexity isn’t the only chatbot that threatens to summarize content so completely that readers fail to see the need to click out to the original source material.

And if AI scrapers like this continue to take publishers’ work and repurpose it for their own businesses, publishers will have a harder time earning ad dollars. That means eventually, there will be less content to scrape. When there’s no more content left to scrape, generative AI systems will then pivot to training on synthetic data, which could lead to a hellish feedback loop of potentially biased and inaccurate content.

TechCrunch
Anthropic looks to fund a new, more comprehensive generation of AI benchmarks
Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluating the performance and impact of AI models, including generative models like its own Claude. Unveiled on Monday, Anthropic's program will dole out payments to third-party organizations that can, as the company puts it in a blog post, "effectively measure advanced capabilities in AI models." "Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem," Anthropic wrote on its official blog.
Engadget
Amazon reportedly investigating Perplexity AI after accusations it scrapes websites without consent
Amazon Web Services has started an investigation to determine whether Perplexity AI is breaking its rules, according to Wired.
Engadget
Apple reportedly even held talks with Meta about an AI partnership as it plays catch-up
According to a report by the Wall Street Journal, citing sources with knowledge of the discussions, Apple has held talks with Meta, Anthropic and Perplexity about the possibility of using the companies' generative AI models.
Engadget
AI companies are reportedly still scraping websites despite protocols meant to block them
According to a letter by a startup called TollBit, as reported by Reuters, multiple AI companies are ignoring "do not crawl" instructions in the robots.txt protocol and scraping websites to get content used to train their technologies.
Engadget
Google is reportedly building AI chatbots based on celebrities and influencers
Google is reportedly building new AI-powered chatbots based on celebrities and YouTube influencers.
TechCrunch
AI apocalypse? ChatGPT, Claude and Perplexity all went down at the same time
After a multi-hour outage that took place in the early hours of the morning, OpenAI's ChatGPT chatbot went down again — but this time, it wasn't the only AI provider affected. On Tuesday morning, both Anthropic's Claude and Perplexity began seeing issues, too, but these were more quickly resolved. It's unusual for three major AI providers to all be down at the same time, which could signal a broader infrastructure issue or internet-scale problem, such as those that affect multiple social media sites simultaneously, for example.
TechCrunch
Meta plans to bring generative AI to metaverse games
Meta plans to bring more generative AI tech into games, specifically VR, AR and mixed reality games, as the company looks to reinvigorate its flagging metaverse strategy. According to a job listing, Meta is seeking to research and prototype "new consumer experiences" with new types of gameplay driven by generative AI, like games that "change every time you play them" and follow "non-deterministic" paths. The focus will be Horizon, Meta's family of metaverse games, apps and creation resources.
Yahoo Life Shopping
Score star-spangled savings of up to 70% off at Coach Outlet's 4th of July sale
Buying a new purse is patriotic, right? These top-selling bags, wallets and backpacks start at just $76.
TechCrunch
Evolve hack fallout continues, fintech M&A heats up and Plaid talks enterprise push
This week, we’re looking at the Evolve Bank hack, three notable acquisitions, Plaid’s enterprise customer growth and more. On June 26, Evolve Bank & Trust, a financial institution that’s popular with fintech startups, announced that it had been victim of a cyberattack and data breach that could have affected its partner companies as well.
Yahoo Finance
AI meets 'Do no harm': Healthcare grapples with tech promises
Companies have been touting the promises of artificial intelligence in healthcare, but a large proportion of doctors and patients still don't fully trust the technology.
TechCrunch
As the AI boom gobbles up power, Phaidra is helping companies manage datacenter power more efficiently
In a May 2024 report, Goldman Sachs predicted that data centers will use 8% of the U.S.'s total power supply by 2030, up from 3% in 2022, as cloud service providers expand to meet the demand for AI infrastructure. Assuming the current trend holds, U.S. utilities will need to invest around $50 billion in power generation capacity to support all the upgraded -- and new -- AI-running data centers. In Kansas, where Meta recently broke ground on a massive new server complex, power utility Evergy announced that it would delay the retirement of its coal plant by up to five years.
TechCrunch
Yieldstreet says some of its customers were affected by the Evolve Bank data breach
The alternative investment platform Yieldstreet is the latest company to reveal that its customers were affected by the recent data breach at Evolve Bank and Trust, TechCrunch has exclusively learned. On Tuesday, Yieldstreet spokesperson Clare Burrows confirmed to TechCrunch that “some Yieldstreet customer information may have been impacted” as a consequence of the Evolve breach. “We have communicated this to all potentially affected customers and continue to follow best practices regarding third-party cybersecurity incidents,” Burrows said in an email.
Yahoo Life Shopping
These 4th of July beauty sales on anti-aging picks will save my 51-year-old skin
Score deals on everything from CeraVe's wrinkle-smoothing retinol to a gentle dark-spot corrector — finds start at just $7.
Yahoo Tech
The best 4th of July Apple sales: Save on AirPods, iPads, MacBooks and more
It's time to savor life, liberty and the pursuit of savings: Get up to $150 off familiar tech favorites.
Yahoo Sports
Former men's basketball players sue NCAA over unauthorized NIL use
The class-action lawsuit also includes six NCAA conferences.
Yahoo Life Shopping
Selena Gomez is all about the 10-in-1 Always Pan, and it's on rare sale for 4th of July
'Makes me feel like a chef': Here's your chance to score this nonstick multitasker from Our Place at a discount.
TechCrunch
Snapchat's latest features help users personalize their accounts
Snapchat is introducing new ways for users to personalize their accounts, the company announced on Tuesday. The updates, which are mostly available for Snapchat+ subscribers, allow users to do things like design a personalized house on Snap Map, share super quick Snaps and edit their Bitmoji, among other things. At a time when social media companies continue to rip off each other's features, Snapchat is introducing new ways for users to personalize their experience on its app in an effort to separate itself from other platforms while also enticing people to sign up for its subscription service, now used by 9 million people.
Yahoo Sports
Marlins DFA former AL batting champion Tim Anderson after disastrous first half
The Marlins are moving on from Tim Anderson.
Autoblog
2024 Mercedes-Benz GLB-Class Review: Baby three-row done right
The 2024 Mercedes-Benz GLB-Class is not as cheap as it used to be, but it still offers a lot to buyers of luxurious small people-movers.
Yahoo Life
A golden retriever provided comfort and calm to gymnasts at the Olympic trials. How pet therapy works.
At the U.S. Olympic gymnastics trials, athletes were comforted by Beacon, a therapy dog.

News

Life

Entertainment

Finance

Sports

New on Yahoo

News outlets are accusing Perplexity of plagiarism and unethical web scraping

Surreptitiously scraping web content

Plagiarism or fair use?

How Perplexity aims to protect itself

Recommended Stories

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

Amazon reportedly investigating Perplexity AI after accusations it scrapes websites without consent

Apple reportedly even held talks with Meta about an AI partnership as it plays catch-up

AI companies are reportedly still scraping websites despite protocols meant to block them

Google is reportedly building AI chatbots based on celebrities and influencers

AI apocalypse? ChatGPT, Claude and Perplexity all went down at the same time

Meta plans to bring generative AI to metaverse games

Score star-spangled savings of up to 70% off at Coach Outlet's 4th of July sale

Evolve hack fallout continues, fintech M&A heats up and Plaid talks enterprise push

AI meets 'Do no harm': Healthcare grapples with tech promises

As the AI boom gobbles up power, Phaidra is helping companies manage datacenter power more efficiently

Yieldstreet says some of its customers were affected by the Evolve Bank data breach

These 4th of July beauty sales on anti-aging picks will save my 51-year-old skin

The best 4th of July Apple sales: Save on AirPods, iPads, MacBooks and more

Former men's basketball players sue NCAA over unauthorized NIL use

Selena Gomez is all about the 10-in-1 Always Pan, and it's on rare sale for 4th of July

Snapchat's latest features help users personalize their accounts

Marlins DFA former AL batting champion Tim Anderson after disastrous first half

2024 Mercedes-Benz GLB-Class Review: Baby three-row done right

A golden retriever provided comfort and calm to gymnasts at the Olympic trials. How pet therapy works.