How Anonymous Is Your Online Data?

Subscribe to Science Friday

How often do you sign up for something online? Maybe it’s an email newsletter, a credit monitoring service, or even a store loyalty card. But in exchange for that buy-one-get-one free box of your favorite cereal, a company now has collected some data about you.

In this era of the Equifax breach and Facebook’s lax data privacy standards, most people are at least somewhat anxious about what happens to the data we give away. In recent years, companies have responded by promising to strip away identifying information, like your name, address, or social security number.

But data scientists are warning us that that isn’t enough. Even seemingly harmless data—like your preferred choice of cereal—can be used to identify you. In a paper from Nature Communications out this week, researchers published a tool that calculates the likelihood of someone identifying you after offering up only a few pieces of personal information, like your zip code and your birth date.

Dr. Julien Hendrickx, co-author of the study out in Nature Communications, joins guest host Molly Webster to discuss the risk of being discovered among anonymous data. And Joseph Jerome, policy council for the Privacy and Data project at the Center for Democracy and Technology, joins the conversation to talk about whether data can ever truly be anonymous.

Segment Guests

Julien Hendrickx

Julien Hendrickx is a professor of mathematical engineering at UC Louvain in Ottignies-Louvain-la-Neuve, Belgium.

Segment Transcript

MOLLY WEBSTER: This is Science Friday. I’m Molly Webster. So we’ve all signed up for something online, like multiple somethings– an app, an email newsletter, credit monitoring, even a store loyalty card. And in exchange for signing up for that thing, a company has then collected some data about you. And you might feel a little anxious about that. I do. I think about this a lot, especially in this era of sort of the Facebook privacy drama, the Equifax.

So companies have responded to this fear by promising to strip away any information from the data they collect. So they’re saying we’re going to hide your name. We’re going to hide your address. We’re going to hide your Social Security number. But in a paper out this week in the journal Nature Communications, data scientists are saying that is not enough.

For the paper, scientists published a tool that calculates the likelihood of someone identifying you from a few pieces of personal information, like your ZIP code and your birth date. The likelihood is very high. Even seemingly harmless data, like where you take a walk, can identify you.

So we’re going to talk about this study, and here to talk it over with me is one of the study’s authors, Julien Hendrickx, a professor of mathematical engineering at UC Louvain in Belgium. Dr. Hendrickx, welcome to Science Friday.

JULIEN HENDRICKX: Thank you. Thank you for inviting me, Molly.

MOLLY WEBSTER: So maybe we can start with just basics. What actually isn’t an anonymous data set?

JULIEN HENDRICKX: So that’s a very tough question. The ideal definition of an anonymous data set is that it’s a data set where it’s impossible for someone using present technology to find out if you are in the data set. And if you’re in the data set, then it’s impossible for the person to find relevant information about specifically you or a specific human being.

That’s the theoretical definition, but it’s not the very imperative one. Because it relies on the inability of doing something. And so we would have to characterize what it means to be unable to find you. And that’s the whole challenge.

MOLLY WEBSTER: Hm. So when I think of things that are important, I think of my name. I think of an email address. I think of my Social Security number, maybe my birthday. But you’re saying that even if I don’t have those things, like if someone says we’re going to strip those things away, someone out there can still figure out who I am.

JULIEN HENDRICKX: Precisely, yes. Well, it depends. If the data set says we have a female person in New York, then of course, I cannot find it’s you. But if I have sufficiently many such attributes after a while or if I have attributes which contains a sufficient amount of information– typically birth dates will contain all the information– after a sufficient number of attributes, there would be a way of knowing with almost certainty that it’s actually you indeed.

MOLLY WEBSTER: Hm. So you showed something like 99% of Americans could correctly be reidentified from data that people had said were anonymous. Is that true?

JULIEN HENDRICKX: That’s true. If you have 15 attributes– so 15 attributes means you already know a lot about the people, but those attributes are not typically publicly available. So you could recover 99.8% of the Americans like this. But already with a small number of attributes, you can correctly identify– I mean, it depends on the people– but 80%. And that’s be certain of identifying a person with 80% of certainty. And that’s already a lot. So even with a small number, you already reach a reasonable degree of certainty. And from an application, a reasonable degree of certainty is already something.

MOLLY WEBSTER: That’s so interesting. So we want to put a call out. If you’re a data scientist or if you work with data in your job, we want to know how you protect users’ identities. So give us a call. Our number is 844-724-8255. That’s 844-SCI-TALK. Or you can tweet at us at @scifri.

So Julien, I did your quiz. And it was very fun to do. And listeners, you can find this quiz online. I did this quiz. The first things I think you had us enter were birth date, ZIP code, and maybe my age. And I was only identifiable 55% of the time, and I felt very good about that. But then I–

JULIEN HENDRICKX: So you did very good. I’m 85.

MOLLY WEBSTER: You’re 85. But then I entered one thing, which was that I had never been married, and I went up to 99%. I think I gasped at my computer.

JULIEN HENDRICKX: Indeed. So sometimes it’s surprising. For me, the killer information was that I have only one car. Somehow it made me shoot up to 99%. And yes, it is sometimes very surprising. So this shows also that there’s not a specific– I mean, for certain people, being married or not, marriage will not tell you a lot about you. And for other people, one specific information, like number of cars for me would just make me absolutely unique in my ZIP code for some reason.

MOLLY WEBSTER: Can you talk at all about like when we say 15 characteristics, what some of those characteristics are, even if it’s just a quick list?

JULIEN HENDRICKX: I don’t remember the whole list, but typically, it would be ZIP codes–

MOLLY WEBSTER: You don’t have to name all 15.

JULIEN HENDRICKX: ZIP code, birth dates, gender, I believe there’s occupation. So certain, I would say, benign information, typically information which every single one of which you would not think identifies you. I mean, ZIP code is ZIP code. Like your kind of occupation, in a small certification, you would think many people have the same kind of occupation. But when you combine all of them, eventually, that makes you essentially unique.

MOLLY WEBSTER: And I think one of the things that was really interesting about your paper was it wasn’t just that anonymous data sets could be made not anonymous, it was that even if you just had a tiny, tiny amount of information from that data set, you could re-identify people. Can you talk about that a little?

JULIEN HENDRICKX: Yes, and that’s very important, because it’s an argument that haven’t been made by people releasing or selling data sets. That’s, OK, not only do we remove names, but we are truly going to remove 99% of the data. So even if I were to find someone whose demographic matches yours, I have no idea whether it’s actually you or if it’s someone else, because 99% of the people are not even in the released data set.

What we show is that based on this 1% of the data, we can compute the likelihood that if I find someone that looks like you, that it’s actually you. And in most cases, it will actually be as soon as we have a reasonable number of data. So somehow we debunk the idea that because we’ve sampled the data, we’ve just removed most of it, that it became safe. That’s in most cases simply not true.

MOLLY WEBSTER: This is so interesting. I want to bring in another guest to help us think a little bit about policy, Joseph Jerome. He’s the Policy Counsel for the Privacy & Data Project at the Center for Democracy & Technology in Washington, DC. He was not involved in this study that we’re talking about this week. Welcome to Science Friday, Joseph.

JOSEPH JEROME: Molly, thank you.

MOLLY WEBSTER: So, Joseph, my question for you is, is there any way that data can actually be anonymous, or have we left that behind?

JOSEPH JEROME: I think we have left it behind. And I think what Dr. Hendrickx’s research really shows here is the tremendous disconnect that exists between data science and the lawmakers and policymakers about what we can now do with data. So many of our laws and our privacy laws assume, basically, that information is either personal or not. Law in general wants to have firm dividing lines– good, bad, in, out. And that just does not work here, because information sort of exists– it’s not a binary. It exists on a spectrum.

MOLLY WEBSTER: Hm. Hm. Yeah, go ahead, Julien.

JULIEN HENDRICKX: Yeah, maybe if I can comment on that. One thing that, indeed, it’s very hard to have a really anonymous data set, but that doesn’t mean things are hopeless. There are solutions being developed or research, where, for example, suppose I have this big data set, which contains a lot of information that could help researchers or policymakers. There are ways of making queries about the data or computing, interestingly, about the data without me giving you the data.

To make it very simple, suppose you are interested in something with traffic in New York, and have all the data about the cars in New York, you would send me specific question, and I would make the computation for you in such a way that I do not reveal anything about any personal individual, but nevertheless, you have the results that would interest you in your research or in your policy decision-making.

And so it’s not, unfortunately, as simple as I have a data set and I’m going to make it anonymous, but there are ways of having a system where I can query data sets to get relevant information without having to have actual access to the data.

MOLLY WEBSTER: Oh, that’s interesting.

JULIEN HENDRICKX: And this is, I believe, something in which you should invest.

MOLLY WEBSTER: That’s interesting. Joseph, just speaking about systems, like, one of the things is we’re doing this radio program predominately from America, but there are other countries that are trying to tackle this. And my understanding is that the EU has a pretty strict or strong legislation on data and what to do with it and how to keep it anonymous?

JOSEPH JEROME: That’s absolutely correct. The reality is we have a lot of different legal rules around de-identified information. And one of the big stories around privacy was the General Data Protection Regulation, where everybody got emails in their mailboxes last year. And the GDPR has an incredibly high bar of not just de-identified information, but truly anonymous information. This is information that can absolutely almost never be related back to an individual data subject.

But what I think is really interesting about that law is, again, my notion of a data spectrum, it introduces a middle category of pseudonymous data, where the information is covered with some of the safeguards in that law, but then some of the rules and regulations about giving individuals access, and the ability to correct and delete that information isn’t provided. And what I think is interesting for American listeners is a lot of that thinking around different spectrums of data in the law is starting to appear in US laws, including in the California Consumer Privacy Act.

MOLLY WEBSTER: So it’s like the idea that some data we might want to protect more than others?

JOSEPH JEROME: Yeah, well, it’s basically trying to incentivize companies to put in place technical and administrative procedures to protect information. I think Dr. Hendrickx is right that there is some incredible thinking going on about how to protect information– privacy enhancing technologies– at some of the major tech companies that we all think about.

But the real challenge here, and I think what this study reveals, is we are living in a sea of information. There’s data available everywhere. And while some folks are doing really sophisticated thinking here, it’s really unclear if that’s trickling down to the day-to-day practices of all the data brokers that people have never heard of or, frankly, all of the sort of small apps that are aggressively using information.

MOLLY WEBSTER: Yeah. Julien, is there, like, a piece of data that I should just be so protective of that I wouldn’t want anyone to get? Or– I don’t know, I guess is that just something that doesn’t mean anything at this point?

JULIEN HENDRICKX: Yeah, well, the problem is that most of the data you have, in my opinion, you have to share with some people at some point. If you go to hospital, you would like them to know your medical history, because they need to know it–

MOLLY WEBSTER: True.

JULIEN HENDRICKX: –like, the IRS has to know your salary and what you make, because they have to know it. So of course, people have to be careful. You would not want to share your Social Security number and credit card number with random people. But apart from that, our study really tells about how information should or should not be shared to third parties. But in your day-to-day life, it’s very hard to not share information with specific parties, because just if you want to live in a normal life.

MOLLY WEBSTER: Well, it’s funny, because I feel like I live in this– my friends will say this– like, this half in, half out world, where, like, I’m not signing up for certain apps, because I don’t want to give them stuff, and know Sharpie out all the information on my prescription bottle before I throw it away, but then in other instances, I’m just throwing my boarding card in the trash and throwing out my student loan bills or something.

JOSEPH JEROME: Well, but Molly, I guess I would just say that that doesn’t necessarily matter. Talking about trying to control information in this way, if you are on your phone, and one of your apps is a financial app, and another one of your apps is a health app, and as you mentioned at the top of the show, you sign up for a loyalty card, well, all of these companies reserve the right to sell and share de-identified information, and so they are all doing whatever they think that they can to de-identify that, and then they share it and distribute it. And then you sort of can layer all of those different data sets back on top of each other, and re-identify folks, and put the portrait of the person back together.

MOLLY WEBSTER: Hm. I’m Molly Webster, and this is Science Friday from WNYC Studios. We’re here talking about how and if we can actually keep our data anonymous. Joseph, I’m wondering if there’s any reason I actually wouldn’t want my data to be anonymous?

JOSEPH JEROME: Well, I think, again, to live in today’s life, you want to share information. You can do a lot of great and beneficial things with information. When we talk about machine learning and algorithms to sort of drive new inferences to provide better health care outcomes, that’s a very real benefit of information that you don’t want to be anonymous.

MOLLY WEBSTER: Hm. That’s interesting. Julien, why did you choose to publish this tool now? Or, like, what are you hoping for?

JULIEN HENDRICKX: Well, we were interested in the science behind this. And we wanted to know whether data was anonymized in such a way, it was indeed anonymous or not. It would’ve been great if it had been, but it turns out it’s not.

One of the reasons we did publish this tool was to make the general public aware of the finding, and to transform something that would have been a sort of dry scientific paper available for specialist researcher to the general public to test, oh, OK, I can test, and, indeed, I would not be anonymous in such and such condition. And I think it’s important to raise awareness of that.

On the other hand, many of my colleagues are doing great research projects, where they used a lot of data that does contain some sensitive information. For example, I’m thinking of medical science, but also public policy question and things. And so one of the fears we sometimes have would be that no single data would be available anymore, and that would remove many opportunities for very good research and good results. So we have to find ways of allowing relevant people to use the data for good purpose without damaging privacy or without creating a risk for people’s privacy.

MOLLY WEBSTER: Joseph, is there an option of just not collecting data in the first place?

JOSEPH JEROME: [LAUGH] Not really. We obviously push companies to try and minimize the amount of information they collect. But frankly, companies see data collection as a strategic asset. And we’ve created a culture, both in Europe and in the United States, where everybody wants to get their hands on as much data as possible. And we live in a sort of always on ecosystem, where we’re all generating a whole lot of information.

And so I hear you trying to say, like, I want to protect certain bits of information, but really individual consumers are not going to be able to control what’s going on here. And I think that’s sort of why privacy advocates like myself are really asking for more law and regulation here.

MOLLY WEBSTER: Yeah, it’s funny. I don’t know whether I should try or just throw up my hands.

JOSEPH JEROME: You should never throw up your hands I think we just need to be cognizant of the challenges we face in a data-driven ecosystem.

MOLLY WEBSTER: Hm. This is great. Thank you guys so much. We have two guests here, Julien Hendrickx, professor of mathematical engineering at UC Leuven in Belgium, and Joseph Jerome, Policy Counsel for the Privacy & Data Project at the Center for Democracy & Technology in Washington, DC.

Guys, one follow-up question. Is there a way, Julien, in which you don’t give up your data? Are there apps that you won’t sign on to?

JULIEN HENDRICKX: OK, I’m generally cautious with the app I’m signing up to. For example, if they ask me whether they want access to my phone directory and thing, I typically tend to refuse, but this is not really driven by my research. Yes, I’m generally cautious, like the general public should be. I’m not sharing just all information with anyone. But otherwise, no, I’m not being particularly careful there.

MOLLY WEBSTER: OK, great. Thank you Thank you, Julien. Thank you, Joseph.

Copyright © 2019 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of Science Friday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies/

Meet the Producers and Host

About Katie Feather

@sciencewritr

Katie Feather is a former SciFri producer and the proud mother of two cats, Charleigh and Sadie.

About Molly Webster

Molly Webster is a producer and guest host of WNYC’s Radiolab in New York, New York.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description

The Data That Makes You Unique

Subscribe to Science Friday

Further Reading

Segment Guests

Segment Transcript

Meet the Producers and Host

About Katie Feather

About Molly Webster

Explore More

Subscribe to Science Friday

Further Reading

Segment Guests

Segment Transcript

Meet the Producers and Host

About Katie Feather

About Molly Webster

Explore More

Belize’s Blue Hole Offers Clues To Mayan Collapse

NASA’s Megarocket Bet To Return To The Moon—And Beyond