The Data That Makes You Unique
How often do you sign up for something online? Maybe it’s an email newsletter, a credit monitoring service, or even a store loyalty card. But in exchange for that buy-one-get-one free box of your favorite cereal, a company now has collected some data about you.
In this era of the Equifax breach and Facebook’s lax data privacy standards, most people are at least somewhat anxious about what happens to the data we give away. In recent years, companies have responded by promising to strip away identifying information, like your name, address, or social security number.
But data scientists are warning us that that isn’t enough. Even seemingly harmless data—like your preferred choice of cereal—can be used to identify you. In a paper from Nature Communications out this week, researchers published a tool that calculates the likelihood of someone identifying you after offering up only a few pieces of personal information, like your zip code and your birth date.
Dr. Julien Hendrickx, co-author of the study out in Nature Communications, joins guest host Molly Webster to discuss the risk of being discovered among anonymous data. And Joseph Jerome, policy council for the Privacy and Data project at the Center for Democracy and Technology, joins the conversation to talk about whether data can ever truly be anonymous.
Julien Hendrickx is a professor of mathematical engineering at UC Louvain in Ottignies-Louvain-la-Neuve, Belgium.
Joseph Jerome is Policy Counsel with the Privacy and Data Project at the Center for Democracy and Technology in Washington, D.C.
MOLLY WEBSTER: This is Science Friday. I’m Molly Webster. So we’ve all signed up for something online, like multiple somethings– an app, an email newsletter, credit monitoring, even a store loyalty card. And in exchange for signing up for that thing, a company has then collected some data about you. And you might feel a little anxious about that. I do. I think about this a lot, especially in this era of sort of the Facebook privacy drama, the Equifax.
So companies have responded to this fear by promising to strip away any information from the data they collect. So they’re saying we’re going to hide your name. We’re going to hide your address. We’re going to hide your Social Security number. But in a paper out this week in the journal Nature Communications, data scientists are saying that is not enough.
For the paper, scientists published a tool that calculates the likelihood of someone identifying you from a few pieces of personal information, like your ZIP code and your birth date. The likelihood is very high. Even seemingly harmless data, like where you take a walk, can identify you.
So we’re going to talk about this study, and here to talk it over with me is one of the study’s authors, Julien Hendrickx, a professor of mathematical engineering at UC Louvain in Belgium. Dr. Hendrickx, welcome to Science Friday.
JULIEN HENDRICKX: Thank you. Thank you for inviting me, Molly.
MOLLY WEBSTER: So maybe we can start with just basics. What actually isn’t an anonymous data set?
JULIEN HENDRICKX: So that’s a very tough question. The ideal definition of an anonymous data set is that it’s a data set where it’s impossible for someone using present technology to find out if you are in the data set. And if you’re in the data set, then it’s impossible for the person to find relevant information about specifically you or a specific human being.
That’s the theoretical definition, but it’s not the very imperative one. Because it relies on the inability of doing something. And so we would have to characterize what it means to be unable to find you. And that’s the whole challenge.
MOLLY WEBSTER: Hm. So when I think of things that are important, I think of my name. I think of an email address. I think of my Social Security number, maybe my birthday. But you’re saying that even if I don’t have those things, like if someone says we’re going to strip those things away, someone out there can still figure out who I am.
JULIEN HENDRICKX: Precisely, yes. Well, it depends. If the data set says we have a female person in New York, then of course, I cannot find it’s you. But if I have sufficiently many such attributes after a while or if I have attributes which contains a sufficient amount of information– typically birth dates will contain all the information– after a sufficient number of attributes, there would be a way of knowing with almost certainty that it’s actually you indeed.
MOLLY WEBSTER: Hm. So you showed something like 99% of Americans could correctly be reidentified from data that people had said were anonymous. Is that true?
JULIEN HENDRICKX: That’s true. If you have 15 attributes– so 15 attributes means you already know a lot about the people, but those attributes are not typically publicly available. So you could recover 99.8% of the Americans like this. But already with a small number of attributes, you can correctly identify– I mean, it depends on the people– but 80%. And that’s be certain of identifying a person with 80% of certainty. And that’s already a lot. So even with a small number, you already reach a reasonable degree of certainty. And from an application, a reasonable degree of certainty is already something.
MOLLY WEBSTER: That’s so interesting. So we want to put a call out. If you’re a data scientist or if you work with data in your job, we want to know how you protect users’ identities. So give us a call. Our number is 844-724-8255. That’s 844-SCI-TALK. Or you can tweet at us at @scifri.
So Julien, I did your quiz. And it was very fun to do. And listeners, you can find this quiz online. I did this quiz. The first things I think you had us enter were birth date, ZIP code, and maybe my age. And I was only identifiable 55% of the time, and I felt very good about that. But then I–
JULIEN HENDRICKX: So you did very good. I’m 85.
MOLLY WEBSTER: You’re 85. But then I entered one thing, which was that I had never been married, and I went up to 99%. I think I gasped at my computer.
JULIEN HENDRICKX: Indeed. So sometimes it’s surprising. For me, the killer information was that I have only one car. Somehow it made me shoot up to 99%. And yes, it is sometimes very surprising. So this shows also that there’s not a specific– I mean, for certain people, being married or not, marriage will not tell you a lot about you. And for other people, one specific information, like number of cars for me would just make me absolutely unique in my ZIP code for some reason.
MOLLY WEBSTER: Can you talk at all about like when we say 15 characteristics, what some of those characteristics are, even if it’s just a quick list?
JULIEN HENDRICKX: I don’t remember the whole list, but typically, it would be ZIP codes–
MOLLY WEBSTER: You don’t have to name all 15.
JULIEN HENDRICKX: ZIP code, birth dates, gender, I believe there’s occupation. So certain, I would say, benign information, typically information which every single one of which you would not think identifies you. I mean, ZIP code is ZIP code. Like your kind of occupation, in a small certification, you would think many people have the same kind of occupation. But when you combine all of them, eventually, that makes you essentially unique.
MOLLY WEBSTER: And I think one of the things that was really interesting about your paper was it wasn’t just that anonymous data sets could be made not anonymous, it was that even if you just had a tiny, tiny amount of information from that data set, you could re-identify people. Can you talk about that a little?
JULIEN HENDRICKX: Yes, and that’s very important, because it’s an argument that haven’t been made by people releasing or selling data sets. That’s, OK, not only do we remove names, but we are truly going to remove 99% of the data. So even if I were to find someone whose demographic matches yours, I have no idea whether it’s actually you or if it’s someone else, because 99% of the people are not even in the released data set.
What we show is that based on this 1% of the data, we can compute the likelihood that if I find someone that looks like you, that it’s actually you. And in most cases, it will actually be as soon as we have a reasonable number of data. So somehow we debunk the idea that because we’ve sampled the data, we’ve just removed most of it, that it became safe. That’s in most cases simply not true.
MOLLY WEBSTER: This is so interesting. I want to bring in another guest to help us think a little bit about policy, Joseph Jerome. He’s the Policy Counsel for the Privacy & Data Project at the Center for Democracy & Technology in Washington, DC. He was not involved in this study that we’re talking about this week. Welcome to Science Friday, Joseph.
JOSEPH JEROME: Molly, thank you.
MOLLY WEBSTER: So, Joseph, my question for you is, is there any way that data can actually be anonymous, or have we left that behind?
JOSEPH JEROME: I think we have left it behind. And I think what Dr. Hendrickx’s research really shows here is the tremendous disconnect that exists between data science and the lawmakers and policymakers about what we can now do with data. So many of our laws and our privacy laws assume, basically, that information is either personal or not. Law in general wants to have firm dividing lines– good, bad, in, out. And that just does not work here, because information sort of exists– it’s not a binary. It exists on a spectrum.
MOLLY WEBSTER: Hm. Hm. Yeah, go ahead, Julien.
JULIEN HENDRICKX: Yeah, maybe if I can comment on that. One thing that, indeed, it’s very hard to have a really anonymous data set, but that doesn’t mean things are hopeless. There are solutions being developed or research, where, for example, suppose I have this big data set, which contains a lot of information that could help researchers or policymakers. There are ways of making queries about the data or computing, interestingly, about the data without me giving you the data.
To make it very simple, suppose you are interested in something with traffic in New York, and have all the data about the cars in New York, you would send me specific question, and I would make the computation for you in such a way that I do not reveal anything about any personal individual, but nevertheless, you have the results that would interest you in your research or in your policy decision-making.
And so it’s not, unfortunately, as simple as I have a data set and I’m going to make it anonymous, but there are ways of having a system where I can query data sets to get relevant information without having to have actual access to the data.
MOLLY WEBSTER: Oh, that’s interesting.
JULIEN HENDRICKX: And this is, I believe, something in which you should invest.
MOLLY WEBSTER: That’s interesting. Joseph, just speaking about systems, like, one of the things is we’re doing this radio program predominately from America, but there are other countries that are trying to tackle this. And my understanding is that the EU has a pretty strict or strong legislation on data and what to do with it and how to keep it anonymous?
JOSEPH JEROME: That’s absolutely correct. The reality is we have a lot of different legal rules around de-identified information. And one of the big stories around privacy was the General Data Protection Regulation, where everybody got emails in their mailboxes last year. And the GDPR has an incredibly high bar of not just de-identified information, but truly anonymous information. This is information that can absolutely almost never be related back to an individual data subject.
But what I think is really interesting about that law is, again, my notion of a data spectrum, it introduces a middle category of pseudonymous data, where the information is covered with some of the safeguards in that law, but then some of the rules and regulations about giving individuals access, and the ability to correct and delete that information isn’t provided. And what I think is interesting for American listeners is a lot of that thinking around different spectrums of data in the law is starting to appear in US laws, including in the California Consumer Privacy Act.
MOLLY WEBSTER: So it’s like the idea that some data we might want to protect more than others?
JOSEPH JEROME: Yeah, well, it’s basically trying to incentivize companies to put in place technical and administrative procedures to protect information. I think Dr. Hendrickx is right that there is some incredible thinking going on about how to protect information– privacy enhancing technologies– at some of the major tech companies that we all think about.
But the real challenge here, and I think what this study reveals, is we are living in a sea of information. There’s data available everywhere. And while some folks are doing really sophisticated thinking here, it’s really unclear if that’s trickling down to the day-to-day practices of all the data brokers that people have never heard of or, frankly, all of the sort of small apps that are aggressively using information.
MOLLY WEBSTER: Yeah. Julien, is there, like, a piece of data that I should just be so protective of that I wouldn’t want anyone to get? Or– I don’t know, I guess is that just something that doesn’t mean anything at this point?
JULIEN HENDRICKX: Yeah, well, the problem is that most of the data you have, in my opinion, you have to share with some people at some point. If you go to hospital, you would like them to know your medical history, because they need to know it–
MOLLY WEBSTER: True.
JULIEN HENDRICKX: –like, the IRS has to know your salary and what you make, because they have to know it. So of course, people have to be careful. You would not want to share your Social Security number and credit card number with random people. But apart from that, our study really tells about how information should or should not be shared to third parties. But in your day-to-day life, it’s very hard to not share information with specific parties, because just if you want to live in a normal life.
MOLLY WEBSTER: Well, it’s funny, because I feel like I live in this– my friends will say this– like, this half in, half out world, where, like, I’m not signing up for certain apps, because I don’t want to give them stuff, and know Sharpie out all the information on my prescription bottle before I throw it away, but then in other instances, I’m just throwing my boarding card in the trash and throwing out my student loan bills or something.
JOSEPH JEROME: Well, but Molly, I guess I would just say that that doesn’t necessarily matter. Talking about trying to control information in this way, if you are on your phone, and one of your apps is a financial app, and another one of your apps is a health app, and as you mentioned at the top of the show, you sign up for a loyalty card, well, all of these companies reserve the right to sell and share de-identified information, and so they are all doing whatever they think that they can to de-identify that, and then they share it and distribute it. And then you sort of can layer all of those different data sets back on top of each other, and re-identify folks, and put the portrait of the person back together.
MOLLY WEBSTER: Hm. I’m Molly Webster, and this is Science Friday from WNYC Studios. We’re here talking about how and if we can actually keep our data anonymous. Joseph, I’m wondering if there’s any reason I actually wouldn’t want my data to be anonymous?
JOSEPH JEROME: Well, I think, again, to live in today’s life, you want to share information. You can do a lot of great and beneficial things with information. When we talk about machine learning and algorithms to sort of drive new inferences to provide better health care outcomes, that’s a very real benefit of information that you don’t want to be anonymous.
MOLLY WEBSTER: Hm. That’s interesting. Julien, why did you choose to publish this tool now? Or, like, what are you hoping for?
JULIEN HENDRICKX: Well, we were interested in the science behind this. And we wanted to know whether data was anonymized in such a way, it was indeed anonymous or not. It would’ve been great if it had been, but it turns out it’s not.
One of the reasons we did publish this tool was to make the general public aware of the finding, and to transform something that would have been a sort of dry scientific paper available for specialist researcher to the general public to test, oh, OK, I can test, and, indeed, I would not be anonymous in such and such condition. And I think it’s important to raise awareness of that.
On the other hand, many of my colleagues are doing great research projects, where they used a lot of data that does contain some sensitive information. For example, I’m thinking of medical science, but also public policy question and things. And so one of the fears we sometimes have would be that no single data would be available anymore, and that would remove many opportunities for very good research and good results. So we have to find ways of allowing relevant people to use the data for good purpose without damaging privacy or without creating a risk for people’s privacy.
MOLLY WEBSTER: Joseph, is there an option of just not collecting data in the first place?
JOSEPH JEROME: [LAUGH] Not really. We obviously push companies to try and minimize the amount of information they collect. But frankly, companies see data collection as a strategic asset. And we’ve created a culture, both in Europe and in the United States, where everybody wants to get their hands on as much data as possible. And we live in a sort of always on ecosystem, where we’re all generating a whole lot of information.
And so I hear you trying to say, like, I want to protect certain bits of information, but really individual consumers are not going to be able to control what’s going on here. And I think that’s sort of why privacy advocates like myself are really asking for more law and regulation here.
MOLLY WEBSTER: Yeah, it’s funny. I don’t know whether I should try or just throw up my hands.
JOSEPH JEROME: You should never throw up your hands I think we just need to be cognizant of the challenges we face in a data-driven ecosystem.
MOLLY WEBSTER: Hm. This is great. Thank you guys so much. We have two guests here, Julien Hendrickx, professor of mathematical engineering at UC Leuven in Belgium, and Joseph Jerome, Policy Counsel for the Privacy & Data Project at the Center for Democracy & Technology in Washington, DC.
Guys, one follow-up question. Is there a way, Julien, in which you don’t give up your data? Are there apps that you won’t sign on to?
JULIEN HENDRICKX: OK, I’m generally cautious with the app I’m signing up to. For example, if they ask me whether they want access to my phone directory and thing, I typically tend to refuse, but this is not really driven by my research. Yes, I’m generally cautious, like the general public should be. I’m not sharing just all information with anyone. But otherwise, no, I’m not being particularly careful there.
MOLLY WEBSTER: OK, great. Thank you Thank you, Julien. Thank you, Joseph.
Katie Feather is a former SciFri producer and the proud mother of two cats, Charleigh and Sadie.
Molly Webster is a producer and guest host of WNYC’s Radiolab in New York, New York.