A recent report in Science Magazine revealed the soft underbelly of what was once considered a well-armored use of “anonymized” consumer information. The study’s authors were able to successfully identify consumers based on several anonymized data sets—specifically, their credit card purchases.
Using purchase metadata with no credit card numbers, names or any other simple identifiers, the report’s co-authors found they could track a specific person’s purchases using three factors: a receipt, an Instagram and a Tweet about a new purchase or a Facebook post that included the location of a favorite bar or a restaurant frequently visited. And Yves-Alexandre de Montjoye, the main author of the report, was successful more than 90% of the time.
The discovery that two or three purchases in a metadata set containing millions of transactions can be pegged to a specific person begs a question: Should data sets that track large scale human behavior be made available to the public?
According to de Montjoye, “The transformational potential of metadata data sets is…conditional on their wide availability.” Scientists need whatever data they are using to be available to their peers so their work can be checked and verified, challenged and improved. The common wisdom is that scientific progress demands it. According to the report, “Several publishers and funding agencies now require experimental data to be publicly available.” As a result, data of all variety are increasingly available to the public—including your credit card purchases.
Given the public availability of these data sets, de Montjoye and his co-authors wanted to find out just how hard it would be to connect specific credit card purchases to the people who made them.
It is a question that identity thieves everywhere ask every day. Disturbingly, the answer was that it was too easy.
Anonymized data is supposed to be the “not-you” version of you. Names and account numbers, IP addresses and email accounts—all the simple stuff that identifies you—are stripped away because none of that stuff is necessary. Researchers just want to look at a lot of behavior. When a study needs to crunch a huge amount of data, it comes in these metadata sets that have been scrubbed of personally identifiable information, and as things stand now you have no control over whether or not you wind up in that information dragnet. Again, for researchers it’s all about that base—or benchmark—the process of identifying trends and patterns. And that’s OK because the use of your information—disconnected from your personal identity—is being used for good, not evil.
Details about purchases, phone calls made, places visited—stripped of the identifiers that connect them to specific people—are regularly used by the government, private researchers and consumer-facing enterprises, and there are plenty of reasons they should be. Metadata sets contain detailed information regarding the what, when and where of the media we regularly consume, where we’ve been, what we did when we were there, what food we like, what sort of illnesses we’ve contracted and how we got better (or didn’t). In theory, these huge samples of human behavior could hold the key to solving intractable problems, everything from the way we fight diseases and feed the world’s population to more populist boons like revealing the best deal on a new car or the fastest commutes from Here to There. Metadata is also used to stop identity thieves from using purloined credit card information—specifically by seeing that a purchase doesn’t match the data for a particular credit card holder. While the value of metadata cannot be understated, in the light of de Montjoye’s findings, the argument for making anonymized metadata available to anyone who cares to have a look seems like a problem waiting to happen. As a matter of fact, de Montjoye’s findings probably represent a welcome addition to any identity thief’s toolbox.
It’s worth repeating here something that has become a drumbeat of sorts: Be very careful what you share on social media. When it comes to the re-identification of anonymized data, the vulnerability documented in the Science report doesn’t exist without the use of the information you put on sites like Facebook, Instagram and Twitter.
Of all the different kinds of data studied in the report, the most troubling was the revelation that credit card purchases could be easily connected with the person who made the charges, since the ease of “re-identification” points to a serious risk for consumers.
It doesn’t matter who you are. It doesn’t matter how many transaction alerts are set up. The only reason everyone hasn’t become a victim of identity-related crime is the backlog. There simply aren’t enough identity thieves to harvest all the lost and free-floating information that’s out there. It pays to be paranoid here. Assume that the bad guys long ago figured out de Montjoye’s method of re-identification—or something that works just as well. The bottom line: If you are in any way plugged into the commerce of daily life, your information is out there, and it is only a matter of time before you become a victim of an identity-related crime.
With so many vulnerabilities, fixable points of opportunity for the bad guys should be resolved with alacrity. We can’t expect consumer behavior to change overnight, but we can expect reasonable protections from the various people and organizations that use consumer data in the pursuit of commerce and creature comforts. Publicly available metadata sets are something we can address. We should do so as soon as possible.