Big data. Business analytics. Threat intelligence. User behavior profiling.
The common wisdom of innovative firms is that the future is all about data collection and analysis. The big winners of the Internet, such as Facebook and Google, seemingly prove the point: Both companies collect an enormous amount of information on their users.
In mid-June, however, Apple seemingly bucked the trend. At the consumer-technology firm’s Worldwide Developers Conference on June 13, Craig Federighi, Apple’s senior vice president of software engineering, told attendees that Apple does not—and will not—create user profiles. Instead, the company has focused on analyzing user data on devices and only uploads anonymized data to its servers to help the firm react to trends among its user base.
“We believe you should have great features and great privacy,” Federighi told attendees. “You demand it, and we are dedicated to providing it.”
The key to Apple’s ability to analyze data yet offer privacy to its customers is an area of research known as differential privacy. The area is not new—a seminal paper dates back to 2002 and cites sources from a quarter century before—but Apple’s commitment to making privacy as important a goal as data analysis is new.
Differential privacy is a concept not a specific technique, according to Cynthia Dwork of Microsoft and Aaron Roth of the University of Pennsylvania, who wrote a 2014 book exploring the topic. “Differential privacy describes a promise, made by a data holder, or curator, to a data subject: ‘You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available,'” the researchers wrote.
Apple uses a data masking technique known as hashing, low-resolution sampling and the injection of noise into the data set to create a system that the company believes satisfies the promise of differential privacy.
If successful, the company could make privacy a more attractive feature of products, reversing the current trend of increasing the collection of data on individuals.
Anonymizing collections of data is important because it allows researchers, businesses and government agencies to use private information for analysis without worrying a breach could occur. The same calculus between privacy and security has played out in the security industry, where companies are increasingly focused on data collection and automated analysis but are loathe to give up incident data that could expose them to breach lawsuits.
“If you are making decisions about data today, you need to know enough about privacy that you do not unnecessarily create risk for your business,” J. Trevor Hughes, president and CEO of the International Association of Privacy Professionals, told eWEEK.
Yet, it’s a difficult problem. In a 2002 paper, Latanya Sweeney, then a professor at the School of Computer Science at Carnegie Mellon University, showed the dangers of assuming that databases are anonymous just because they do not include names and addresses. Using health information from an anonymized database representing 135,000 state employees and their families, and combining it with a voter registration list purchased for $20, Sweeney was able to de-anonymize people just by using the three common fields in the data sets: gender, birth date and ZIP code. The two lists, for example, exposed the health data for the governor of Massachusetts, who lived in Cambridge.
“According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code,” she wrote.
Will Differential Privacy Give Data-Focused Firms Both Security and Privacy?
Sweeney, now a professor of Government and Technology in Residence at Harvard University, presented a model in the paper for anonymizing databases so that a record of any individual in the collection resembled at least a certain number of other records, called “k.” Known as k-Anonymity, the technique identifies the attributes—such as the combination of birth date, gender and ZIP code—that together form a quasi-identifier for the individuals in the data set.
Such quasi-identifiers are the crux of the de-anonymization problem. In 2007, two researchers from the University of Texas at Austin showed that movie ratings could be used as a pseudo-identifier, allowing individuals to be picked out of a massive database. The researchers used data published by Netflix as part of its $1 million Netflix prize and movie ratings published online by individuals to show that a handful of ratings could identify individuals in the data set.
Despite almost a decade and a half of research, few companies have attempted to deliver differential privacy to their consumers. One reason: Without strong penalties for leaking personal information, companies have no incentive to anonymize their databases. And because companies are worried that losing user details could put them at a future disadvantage, they keep full records on their customers.
Only in the heavily regulated retail market, where companies can be fined for losing credit card information in a breach, do businesses make an informed decision to delete unnecessary data, said the IAPP’s Hughes. Because of the penalties associated with leaking credit card data, more companies are either not collecting the information or deleting it as soon as possible.
“Beyond data like that, where the leak of the data has real consequences, I don’t think organizations have wrapped their heads around the idea that, if you don’t need it, you shouldn’t collect it, and if you have collected it and find that you don’t need it, you should get rid of it,” he said.
The view will slowly change. The industry seems to be slowly learning that keeping data is not always the best option. And, if Apple and others can prove that they can derive the necessary benefits from data while protecting privacy, the momentum to privacy may shift.
Apple is not alone. Cloud services and analytics company Neustar, for example, claims it uses the conceptual basis of differential privacy to pinpoint the potential for re-identification in the data it collects, Becky Burr, chief privacy officer at Neustar, told eWEEK in an email interview.
“We shouldn’t think about this as a zero sum game—we need to respect personal privacy and use data to make better decisions,” Burr said. “In a world where data is fuel for technology and technology unlocks the power of data, we need to apply privacy-by-design principles to ensure that technology is built to ensure privacy-aware data usage.”