Large Data Sets Dangerous to Privacy, MIT Study Shows

A new study of a large data set, this time of credit-card transactions, has shown that de-anonymizing users is not difficult in the era of big data.

big data analytics

The allure of big data for companies and researchers is in its ability to make connections between disparate events, allowing better insight into the relationships in the data.

However, for the individuals whose data is collected, big data also means far less privacy. The latest example, published by Massachusetts Institute of Technology researchers, found that four dates and locations of recent purchases are all that is needed to identity 90 percent of people making the purchases. If price information is included, then only three transactions are necessary.

The study, published in the latest issue of Science, used anonymized data on 1.1 million people and transactions at 10,000 stores. More than 40 percent of the people could be identified with just two data points, while five purchases identified nearly everyone.

The conclusion: With big data comes big responsibility.

"[We] really do believe that this data has great potential and should be used," Yves-Alexandre de Montjoye, an MIT graduate student and the primary author of the paper, said in a statement. "We, however, need to be aware [of] and account for the risks of re-identification."

Rather than posing a unique problem, the threat of stripping away anonymity appears to be a general danger of analyzing large data sets. Two years ago, de Montjoye collaborated with another university to conduct an analysis of mobile phone data that found nearly identical results. Four pieces of data—in this case, the location of a base station used by a cell phone—were sufficient to identify 95 percent of the people among 1.5 million cell phone users.

Previous studies analyzing data sets composed of AOL users and, in a separate case, Netflix users have found similar impacts on privacy: A handful of records can effectively de-cloak almost any user.

As technology becomes more ubiquitous and consumers carry around multiple devices connected to the Internet—often referred to as the Internet of things—many do not consider that their actions are now being tracked by multiple third parties, Ken Westin, senior security analyst with Tripwire, told eWEEK.

"Think of how many devices we interact with every day when we make our transactions," he said. "We are leaving a trail in our electronic records."

Many companies "anonymize" the collected data by adding imprecision into the data sets. A technique known as "binning," for example, creates discrete bins that correspond to a range of values and assign the records to those bins. Yet such techniques only increase the number of transactions needed to de-anonymize the data, the MIT researchers found. Turning the time and location of each purchase into a week number and a approximate region consisting of 150 stores, for example, still allowed the researchers to identify 70 percent of the users from four data points.

The researchers suggest that large data sets should not be publicly released, but kept by a custodian who could then allow researchers to conduct queries and submit programs to analyze the data. They proposed a system that would do exactly that.

Users should be wary of any large data set, even if a company claims that it has been anonymized, Luther Martin, chief security architect at Voltage Security, said in a statement.

The research "suggests that it's probably better to stop debating exactly how much risk there is in data sets that may not at first seem to contain sensitive information," he said.

Robert Lemos

Robert Lemos

Robert Lemos is an award-winning freelance journalist who has covered information security, cybercrime and technology's impact on society for almost two decades. A former research engineer, he's...