Every ten years the U.S. Census Bureau conducts a nationwide survey that sets the terms for the country’s democracy. The questionnaire yields rich data, including people’s names, street addresses, ages, races, ethnicities, and other details. People’s responses help determine dynamics of power, such as how seats are apportioned in the House of Representatives, where voting districts get divided, and which communities receive federal funds.
But the bureau, tasked with releasing summaries of the results while simultaneously protecting people’s privacy, faces a Catch-22. “Every time you publish a statistic you leak information about that confidential database,” as Simson Garfinkel, a computer scientist with the bureau, told a Census advisory committee in May.
If people believe their responses will not be kept private and secured, they may opt not to respond. And with the proposed addition of a sensitive question to the 2020 Census—asking whether a respondent is an American citizen—heeding the privacy mandate becomes paramount.
There’s a problem though: the usual methods for preserving people’s privacy no longer afford sufficient protection. In November 2016, a team of researchers successfully reconstructed an alarming portion of the most recent Census’s confidential database. Out of 308,745,538 respondents, records for 46% of the population were reassembled using public 2010 Census data and the team’s statistical tools; allowing for a year’s wiggle room on age, the proportion jumped to 71%. By combining the bureau’s published tables with other datasets, the researchers found they could re-identify 17% of the population.
John Abowd, chief scientist at the U.S. Census Bureau and leader of the 2016 study, says the old privacy safeguards are ineffective. Swapping respondent information between different geographic blocks, for instance, won’t cut it. “Turns out nobody is well enough buried in the haystack,” he says.
To address the issue, Abowd has led the charge implementing cutting edge “differential privacy” techniques for the upcoming Census. The process intentionally injects noise, or random variables, into the system, an approach used by tech giants in popular consumer products everywhere from the Google Chrome web browser to Apple iPhones to Microsoft Windows. The result: would-be database un-maskers cannot learn detailed personal records with granularity from Census data alone.
It’s a tradeoff between precision and privacy. While some social science researchers grouse that the new approach will impede their work, the backlash could have been far worse, says Erica Groshen, former commissioner of the Bureau of Labor Statistics. If the bureau hadn’t changed anything, “who knows how the House and Senate would have reacted to widespread reports of the privacy and confidentiality of people’s responses not being protected?” she says.
“Rather than having it go the legislative route, the Census has decided to go the scientific route,” Groshen says.
This article originally appeared in the June 2019 issue of Fortune.