Lost in Transcription: Auto-Captions Often Fall Short on Zoom, Facebook, Google Meet, and YouTube

Kaveh Waddell

August 24, 2022 at 2:00 AM·11 min read

Oops!
Something went wrong.
Please try again later.

New research found errors that pose hurdles for users who are deaf or hard of hearing, or whose first language isn’t English

By Kaveh Waddell

Data visualizations by Andy Bergmann

When William Albright watches tutorial videos about coding on YouTube, sometimes the presentations stop making sense. That’s because Albright, a software developer in San Francisco who is deaf, uses automatic captions to follow what people are saying in videos—and those captions are often awash in errors.

“Sometimes it’s just gibberish,” he says.

“I can deal with ‘SQL’ spelled as ‘sequel,’ ” says Albright, referring to a widely used programming language. “But if the whole sentence is littered with mistakes like that, and the material itself is supposed to be challenging, it’s hard to stay focused.”

Mistakes in auto-captions plague many leading videoconferencing and social media apps, according to a new study by researchers at Northeastern University and Pomona College, who worked with Consumer Reports to test auto-captions in seven popular products.

All of the programs made some mistakes, with some getting about 1 in 10 words wrong. And the results were even worse when English wasn’t the speaker’s first language—even if they were fluent.

These flaws can cause problems for people who are deaf or hard of hearing and often use auto-captions to participate in work meetings and one-on-one calls or to just enjoy online entertainment. They also matter for people who aren’t completely comfortable with English, who might use captions to keep up with a recorded college lecture or watch how-to videos on YouTube or Facebook.

An estimated 11.4 million Americans ages 5 and older have a hearing disability, and over 25 million Americans ages 5 and older don’t speak English very well, according to census data.

“If captions are wildly inaccurate, you can’t figure out what the whole conversation is about,” says Christian Vogler, PhD, director of the Technology Access Project at Gallaudet University. Vogler, who is deaf, advocates for technology that works better for people with hearing disabilities. “If there are too many errors, you might not even be able to figure out the gist,” he says. “And it also imposes a cognitive load trying to compensate for errors and puzzling out the meaning.”

For Albright, an error-ridden transcript can be a deal-breaker. He often has to give up on a video that’s poorly captioned and go looking for a more accessible alternative.

It’s about “equal access,” he says. “Hearing folks wouldn’t accept unintelligible audio. Same principle.”

Tracking Auto-Caption Errors

It’s hard for software to get transcriptions exactly right. Any given recording might have patchy audio quality or people speaking over one another. Plus, even just among English speakers, there are countless accents, dialects, and regional quirks to account for.

To simplify things for our testing, CR and our partners at Northeastern and Pomona settled on audio from the popular TED lecture series. (CR funded the study, which is currently under peer review ahead of publication.)

TED Talks are recorded professionally, and most feature one speaker at a time who is trying to communicate clearly, making the audio a relatively easy challenge for automated captioning systems. (We didn’t use any clips that included multiple speakers, music, or video playback.) Because they are consistent, TED Talks let us compare the talks against one another to see if the software had a harder time transcribing people from a particular racial group, age, or gender.

We used seven popular software platforms to generate transcripts for nearly 850 TED Talks, delivered by a broad range of speakers. Then we compared the computer-generated captions with official transcripts prepared manually by trained TED volunteers. To see how the auto-captions did, we tallied the number of times the computer-generated transcripts differed from the official human-written ones.

Our study included four videoconferencing platforms: BlueJeans, Cisco WebEx, Google Meet, and Zoom. We also looked at Microsoft Stream, the company’s video streaming service, because it uses the same voice technology as Skype and Microsoft Teams. And we tested Facebook and YouTube. (CR posts videos to YouTube and our own website. We did not test the captioning software we use on our site because it’s not a consumer product.)

Overall, the platforms made a lot of mistakes, even though TED Talks were a best-case scenario. “We threw them a softball and they’re still not knocking it out of the park,” says David Choffnes, PhD, an associate professor of computer science at Northeastern who was one of the researchers.

But some products performed better than others. WebEx had more mistakes than Google Meet, for instance, and YouTube did better than Facebook. (See the graph below.) Even within each platform, however, there were big differences. For Zoom, for example, the very best transcription had just two errors per 100 words, while at its worst the software mistranscribed nearly every third word.

Next, we ran statistical models to determine whether any characteristics of the various speakers explained the variation in the error rates. We controlled for the speaker’s age, gender, race and ethnicity, first language, and speech rate. As it turned out, only gender and first language status independently affected the variation in transcription mistakes.

Though the accuracy differences we found between groups of speakers may not seem large, they can have a real impact on comprehension.

Take Zoom’s average accuracy gap between a native and non-native speaker: It’s 3.6 percent, which looks like a small number. But imagine if you misunderstood three or four extra words out of every hundred—on top of the roughly 8 percent of words that the auto-captions already bungled, according to our study. English is often spoken at about 150 words per minute, so those mistakes can pile up fast.

That means that an auto-caption user is less likely to be able to understand people whose native language isn’t English, whether that’s a colleague, a teacher, or a YouTube personality.

“Some groups are at a disadvantage in situations where they need captions: less authenticity, more potential for miscommunication, and by extension more potential for major screw-ups,” says Gallaudet’s Vogler.

What the Companies Said

We asked representatives from all seven products to comment on our findings.

A spokesperson for YouTube said that our results matched up with the company’s “expectations for performance,” and that the company is working on improving YouTube “so it works better for everyone,” including by working with linguists.

Microsoft said our findings are roughly in line with its internal testing, which also reveal lower accuracy when transcribing men and second-language English speakers. The spokesperson said Microsoft regularly assesses caption accuracy by gender, age, language variety, and regional and foreign language accents.

A Zoom spokesperson said, “We’re continuously enhancing our transcription feature to improve accuracy toward a variety of factors, including English dialects and accents.” And Google said through a spokesperson that it’s working to “improve the accuracy of live captions and translations so even more users can participate and stay engaged using Google Meet.”

A spokesperson for Verizon, which owns BlueJeans, said that the company uses third-party software for its captions, but it declined to name the software provider. The spokesperson said, “We have the ability to train our models to improve accuracy and are always evaluating ways to do so.”

Cisco, which owns WebEx, says its own auto-caption testing puts WebEx ahead of two “best-in-class speech recognition engines,” but wouldn’t say which products those were. That’s in contrast to our study, where WebEx had the highest error rate. A Cisco spokesperson said the discrepancy may be explained by the fact that WebEx’s captions are fine-tuned for videoconferencing rather than other scenarios like TED-style lectures.

Meta, Facebook’s parent company, declined to comment on CR’s findings, but a spokesperson pointed to recent work from its AI research team that found accuracy gaps by speaker gender and skin tone. (The research was not performed on Facebook’s own auto-caption technology.)

How Speech Technology Can Go Wrong

Experts say accuracy gaps like the ones we found usually arise when a system hasn’t been taught to understand a broad variety of English speakers.

Speech recognition systems learn to interpret spoken language by training on enormous datasets of recorded speech and transcripts. That allows the systems to match patterns in real-world speech with patterns it learned from the training data, and produce transcripts without human intervention.

But if a system was trained on data that mostly featured a certain type of speaker—people who grew up speaking English, say, or white American English speakers—it may end up lagging when it tries transcribing other types of English speakers in the real world.

This technology powers much more than auto-captions. It’s also behind digital assistants like Siri and Alexa, dictation services that let you speak a text message rather than typing it, and call-center software that tries to understand spoken commands from consumers. Companies are looking for more opportunities to let people control their gadgets through voice instead of typing and swiping. And as that happens, any shortcomings with speech-to-text software will become even more important.

In our research, only gender and first language affected auto-caption accuracy, but other academic studies have found differences correlated with race and ethnicity.

In a 2020 study, for example, Stanford researchers found that major speech recognition products misunderstood Black users at nearly twice the rate that they misunderstood white users. The almost twofold disparity existed in products from all five companies they studied: Apple, Amazon, Google, IBM, and Microsoft.

And in a pair of papers published in 2017, Rachael Tatman, who has a PhD in linguistics from the University of Washington, found some significant differences in YouTube’s auto-caption accuracy based on a speaker’s regional dialect.

It’s likely these differences didn’t appear in our research because we used TED Talks to test caption quality, says Nicole Holliday, PhD, an assistant professor of linguistics at Pomona College and co-author of the CR-associated study. Much of the variation that exists in people’s informal speech disappears when they’re speaking from a stage.

“The speech style used by TED talkers is very formal and very rehearsed, and it’s recorded with very high audio quality,” Holliday says. “This means that in the real world, the effects that we found are likely to be amplified. It also means that there exist other biases that we didn’t find, related to the speech style.”

The hyperformal setting also helps explain the gender differences in accuracy, Holliday and Tatman say. Linguistics research has shown that generally women tend to use more “standardized” language. That means they may be less likely to use informal or socially stigmatized speech patterns that often cause problems for speech recognition systems.

Building Better Auto-Captioning

Research like CR’s should prod tech companies to improve their speech-recognition systems, experts say.

“In some tech spheres, there’s an incorrect belief that automatic speech recognition is a ‘solved problem,’ ” Holliday says. “But in reality, it works much better for a very specific type of idealized speaker who is speaking in a very formal style. The vast majority of the time when people are speaking, they’re not fitting the mold that companies assume.”

That doesn’t mean it’s an easy fix. Auto-caption systems have become steadily better in recent years, according to experts and people who use them regularly. Improvements are increasingly costly, Tatman says, and will require better training data as well as better signal processing for filtering out background noise or zeroing in on a speaker’s voice.

Some newcomers to the field are trying to chip away at the problem. Vogler, the Gallaudet assistive technology expert, is helping to build an auto-captioning system called GoVoBo, which is designed for computer users who are deaf or hard of hearing.

If you want to help make speech recognition work better for a broader group of people, you can donate a recording to Common Voice, a project at the nonprofit Mozilla Foundation to create a large, openly accessible database of speech samples in dozens of languages.

You don’t have to stick to your first language, either. Common Voice encourages volunteers to speak in second languages to help create a training set that’s less likely to be tilted against second-language speakers.

“One of the leading causes of performance bias in [automated speech recognition] and speech technology more widely is unbalanced, unrepresentative training data,” says EM Lewis-Jong, product lead for the Mozilla project. “Common Voice gives people around the world a real, practical chance to mobilize their voice to address these bias issues, by contributing to a dataset and mitigating bias at the source.”

Editor’s note: Our work on privacy, security, and data issues is made possible by the vision and support of the Ford Foundation, Omidyar Network, Craig Newmark Philanthropies, and the Alfred P. Sloan Foundation.

Consumer Reports is an independent, nonprofit organization that works side by side with consumers to create a fairer, safer, and healthier world. CR does not endorse products or services, and does not accept advertising. Copyright © 2022, Consumer Reports, Inc.

Yahoo Sports
NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in
Not everyone was thrilled with their team's draft on Thursday night.
7h ago
Yahoo Finance
Jamie Dimon is worried the US economy is headed back to the 1970s
JPMorgan's CEO is concerned the US economy could be in for a repeat of the stagflation that hampered the country during the 1970s.
3d ago
Yahoo Sports
Broncos, Jets, Lions and Texans have new uniforms. Let's rank them
Which new uniforms are winners this season?
2d ago
Yahoo Sports
Based on the odds, here's what the top 10 picks of the NFL Draft will be
What would a mock draft look like using just betting odds?
4d ago
Yahoo Sports
Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill
Vincent Goodwill and Tom Haberstroh break down last night’s NBA Playoffs action and preview several games for tonight and tomorrow.
2d ago
Yahoo Sports
Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion
The Red Sox were already mourning the loss of Tim Wakefield from that 2004 team.
6d ago
Yahoo TV
Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.
Ryan Gosling, who starred in the skit, couldn't keep a straight face — and neither could some of the "Saturday Night Live" cast.
3d ago
Yahoo Sports
Ryan Garcia drops Devin Haney 3 times en route to stunning upset
The 25-year-old labeled "mentally fragile" by many delivered the upset for the ages.
5d ago
Autoblog
These are the fastest-selling new cars of 2024
iSeeCars found that a handful of brands sell new cars much faster than others and noted that EVs are taking longer to sell than hybrids.
16h ago
Autoblog
These are the cars being discontinued for 2024 and beyond
As automakers shift to EVs, trim the fat on their lineups and cull slow-selling models, these are the vehicles we expect to die off soon.
3d ago
Yahoo Sports
Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal
Dan Wetzel, Ross Dellenger & SI’s Pat Forde react to the huge performance this weekend by Texas QB Arch Manning, Michigan and Notre Dame's spring games, Jaden Rashada entering the transfer portal, and more
4d ago
Yahoo Sports
Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions
Reid's deal reportedly runs through 2029 and makes him the highest-paid coach in the NFL.
3d ago
Yahoo Sports
Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal
Cortés' attempt didn't fool Andrés Giménez, who fouled off the pitch.
6d ago
Yahoo Life
Here’s when people think old age begins — and why experts think it’s starting later
People's definition of "old age" is older than it used to be, new research suggests.
4d ago
Yahoo Finance
Retirement confidence in the US ticks up; new rule for financial advisers is set to start
Two-thirds of Americans reported that they feel confident they have enough money for a comfortable retirement, up a notch from last year.
1d ago
Yahoo Finance
Donald Trump nabs additional $1.2 billion 'earnout' bonus from DJT stock
Trump is entitled to an additional 36 million shares if the company's share price trades above $17.50 "for twenty out of any thirty trading days" over the next three years.
3d ago
Yahoo Sports
Jets trade QB Zach Wilson to Broncos
Wilson's starting over in Denver.
3d ago
Yahoo Sports
WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not
Here are five franchises who stood out, for better or for worse.
10d ago
Yahoo Sports
Report: Red Bull Racing chief designer Adrian Newey to leave Formula 1 team
Red Bull and Max Verstappen have dominated Formula 1 since new car rules were implemented in 2022.
19h ago
Yahoo Sports
Reggie Bush celebrates return of Heisman Trophy, calls out NCAA with defamation suit still pending: 'I never once cheated'
Reggie Bush took a victory lap at the Los Angeles Coliseum Thursday while delivering a message to the NCAA.
11h ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Lost in Transcription: Auto-Captions Often Fall Short on Zoom, Facebook, Google Meet, and YouTube

Tracking Auto-Caption Errors

What the Companies Said

How Speech Technology Can Go Wrong

Building Better Auto-Captioning

Recommended Stories

NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in

Jamie Dimon is worried the US economy is headed back to the 1970s

Broncos, Jets, Lions and Texans have new uniforms. Let's rank them

Based on the odds, here's what the top 10 picks of the NFL Draft will be

Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill

Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion

Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.

Ryan Garcia drops Devin Haney 3 times en route to stunning upset

These are the fastest-selling new cars of 2024

These are the cars being discontinued for 2024 and beyond

Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal

Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions

Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal

Here’s when people think old age begins — and why experts think it’s starting later

Retirement confidence in the US ticks up; new rule for financial advisers is set to start

Donald Trump nabs additional $1.2 billion 'earnout' bonus from DJT stock

Jets trade QB Zach Wilson to Broncos

WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not

Report: Red Bull Racing chief designer Adrian Newey to leave Formula 1 team

Reggie Bush celebrates return of Heisman Trophy, calls out NCAA with defamation suit still pending: 'I never once cheated'