Apple announced it will use a technique known as differential privacy to collect data on groups while keeping individuals anonymous. The technique promises to emphasize privacy, while giving companies access to important data.
Big data. Business analytics. Threat intelligence. User behavior profiling.
The common wisdom of innovative firms is that the future is all about data collection and analysis. The big winners of the Internet, such as Facebook and Google, seemingly prove the point: Both companies collect an enormous amount of information on their users.
In mid-June, however, Apple seemingly bucked the trend. At the consumer-technology firm's Worldwide Developers Conference on June 13
, Craig Federighi, Apple's senior vice president of software engineering, told attendees that Apple does not—and will not—create user profiles. Instead, the company has focused on analyzing user data on devices and only uploads anonymized data to its servers to help the firm react to trends among its user base.
"We believe you should have great features and great privacy," Federighi told attendees. "You demand it, and we are dedicated to providing it."
The key to Apple's ability to analyze data yet offer privacy to its customers is an area of research known as differential privacy. The area is not new—a seminal paper dates back to 2002
and cites sources from a quarter century before—but Apple's commitment to making privacy as important a goal as data analysis is new.
Differential privacy is a concept not a specific technique, according to Cynthia Dwork of Microsoft and Aaron Roth of the University of Pennsylvania, who wrote a 2014 book exploring the topic. "Differential privacy
describes a promise, made by a data holder, or curator, to a data subject: 'You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available,'" the researchers wrote
Apple uses a data masking technique known as hashing, low-resolution sampling and the injection of noise into the data set to create a system that the company believes satisfies the promise of differential privacy.
If successful, the company could make privacy a more attractive feature of products, reversing the current trend of increasing the collection of data on individuals.
Anonymizing collections of data is important because it allows researchers, businesses and government agencies to use private information for analysis without worrying a breach could occur. The same calculus between privacy and security has played out in the security industry, where companies are increasingly focused on data collection and automated analysis but are loathe to give up incident data that could expose them to breach lawsuits.
"If you are making decisions about data today, you need to know enough about privacy that you do not unnecessarily create risk for your business," J. Trevor Hughes, president and CEO of the International Association of Privacy Professionals, told eWEEK
Yet, it's a difficult problem. In a 2002 paper
, Latanya Sweeney, then a professor at the School of Computer Science at Carnegie Mellon University, showed the dangers of assuming that databases are anonymous just because they do not include names and addresses. Using health information from an anonymized database representing 135,000 state employees and their families, and combining it with a voter registration list purchased for $20, Sweeney was able to de-anonymize people just by using the three common fields in the data sets: gender, birth date and ZIP code. The two lists, for example, exposed the health data for the governor of Massachusetts, who lived in Cambridge.
"According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code," she wrote.