Crowdsourcing Data, While Keeping Yours Private

Subscribe to Science Friday

At the 2016 Worldwide Developer Conference, Apple engineer Craig Federighi described one way the company planned to learn from its customers, without compromising the individual privacy of any particular user: “differential privacy.” Google, too, has used a form of differential privacy for several years in its Chrome browser.

Cynthia Dwork, a co-creator of differential privacy, says it’s a “mathematically rigorous definition of privacy” in which the statistical analysis of a dataset has the same outcome, whether any individual user’s data is included or not. That means the dataset as a whole can provide meaningful insights, without revealing anything about the preferences of the individuals within the dataset.

For example: Suppose a government wants to survey its citizens, including you, about whether they’d used illegal drugs. But instead of answering directly, you flip a coin. If it’s heads, answer truthfully. If it’s tails, you flip the coin a second time. If it’s heads, answer yes; if it’s tails, answer no.

After this exercise, the government can’t know for sure whether you’ve used illegal drugs—your response has random noise built in. But by analyzing the results from all citizens, trends emerge on the frequency of illegal drug use.

Dwork and security researcher Matthew Green discuss how differential privacy and randomized responses are being used today, from analyzing texting trends to the smart grid.

Segment Guests

Cynthia Dwork

Cynthia Dwork is the co-creator of Differential Privacy, and a distinguished scientist at Microsoft in Mountain View, California.

Segment Transcript

IRA FLATOW: This is Science Friday. I’m Ira Flatow. When you type stuff into your web browser, your computer or your smartphone starts suggesting things, right? As soon as you start typing in the box, it’s predicting what it thinks you want. Same thing when you text, your phone auto completes your thoughts or maybe suggest the perfect emoji. You share your data, I share mine, and everyone gets to share in that benefit of a smarter data trained machine, right? Very time saving, right?

Well, it might come with a little bit of a deal with the devil, how we get all that convenience. It’s sort of a technological deal with the devil. But can you get those crowdsource benefits and preserve your privacy? There is the devil in the details.

Google’s Chrome browser use something called differential privacy to achieve that goal. And a few weeks ago, at Apple’s Worldwide Developers conference, Craig Federighi hinted that Apple’s getting into the same game.

CRAIG FEDERIGHI: Differential privacy is a research topic in the area of statistics and data analytics that uses hashing, subsampling, and noise injection to enable this kind of crowdsourced learning, while keeping the information of each individual user completely private.

IRA FLATOW: Hashing, subsampling, noise injection, lots of technobabble. But we have a few folks who might help to straighten that out and tell us what’s really going on. Cynthia Dwork is the co-creator of differential privacy and a distinguished scientist at Microsoft. She’s based in Mountain View, California. Welcome to Science Friday.

CYNTHIA DWORK: Thank you, Ira. It’s a pleasure to be here.

IRA FLATOW: Well, thank you. You’re welcome. Matthew Green is an assistant professor at the Information Security Institute at Johns Hopkins University in Baltimore. Welcome to Science Friday.

MATTHEW GREEN: Thanks for having me.

IRA FLATOW: And our listeners who want to get in on the conversation, 844-724-8255, 844-SCI-TALK. You can also tweet us @SCIFRI.

Dr. Dwork, you invented what, shall I say, English language definition of what differential privacy is, something a little less technical than what Greg mentioned in his keynote.

CYNTHIA DWORK: Yes, so we invented the mathematical definition and the English language definition says essentially this, the outcome of any statistical analysis is essentially equally likely independent of whether any individual chooses to opt in to the dataset or to opt out of the data set. So in other words, essentially, the same things can happen to me with basically the same probabilities, whether I have allowed my data to be used in the data analysis or if I have withheld my data.

IRA FLATOW: Wow, I bet that does seem like magic. Matthew Green, can you give us an example of what she was talking about?

MATTHEW GREEN: Well, so there’s a very old statistical technique called randomized response, which we use to ask people questions they might not want to answer honestly, like have you ever stolen from your employer. And the basic idea with this technique, which is one example of a differentially private technique, is that when I ask you that question, you flip a coin. If it comes up heads, you answer honestly. And if it comes up tails, you answer at random. And the basic idea there is that if I see your response. I don’t really learn much about what you individually did. But if I can aggregate many, many different responses, and I compute a statistical average number of people who are going to answer that question positively, I can subtract away the noise and I can learn things that are useful to me.

IRA FLATOW: So you can basically hide the response through mathematical noise, so to speak, so we don’t know who actually made the response. And that’s why, Dr. Dwork, you say it’s 50/50 whether we know or don’t know who it was.

CYNTHIA DWORK: Actually, in the example that Matthew just gave, we may know exactly who is responding, but because of the randomness that’s introduced in the procedure for generating the yes or no, I stole from my employer, we don’t actually know if a yes means it really happened, or if the person is saying yes because the coin flips said to say yes. And so even if we knew exactly who it was, we still would only have a vague statistical hint as to whether they actually did or did not engage in the behavior.

And these statistics all hints, while telling us nothing about the individual or very, very little about the individual, can be aggregated over many individuals to understand the fraction of people who cheated from their employers.

IRA FLATOW: Without giving away who that individual was?

CYNTHIA DWORK: Exactly.

IRA FLATOW: So we know about the whole population, Matthew, but we don’t know anything about the individuals in them?

MATTHEW GREEN: That’s correct.

IRA FLATOW: And is this what Apple is doing in their new iPhone system that’s going to be launched?

MATTHEW GREEN: So it appears that they’re using a variant of this technique, this randomized response technique. And what it does is almost exactly the same thing, except they’re taking that technique and they’re kind of putting it on steroids. So instead of just asking one question about what your phone has done today, they can ask very complicated questions like, for example, did you erase a word when you were writing a message and replace it with an emoji.

And they can learn from that once they aggregate many people’s information. They can learn, well, are people commonly using this emoji instead of a particular word? And they can take that information and use that to make that a suggestion to other people as well.

IRA FLATOW: So what do you say to people, Dr. Dwork, who are afraid that their privacy is being compromised here?

CYNTHIA DWORK: I think it’s quite a challenge to articulate the nature of a probabilistic guarantee. You know, I need to improve my ability to teach the public understanding of risk here.

[LAUGHTER]

IRA FLATOW: So would you say then they shouldn’t be worried? I’ll put it very simply like that. Don’t worry, they’re not giving away whom you are.

CYNTHIA DWORK: Yes, that’s what I would say.

[LAUGHTER]

IRA FLATOW: OK, that’s very simple. Now, I understand– let’s talk about Microsoft. I know Microsoft is working with a power company in California to implement some of this differential privacy technology. Is that right? What does that have to do with the power grid?

CYNTHIA DWORK: Well, your individual power consumption can actually reveal quite a bit about you. Apparently, it can even reveal which movies you’re watching because the patterns of when the screen lights up are very distinctive for a movie, and that takes a certain amount of power. So the California Public Utilities Commission require certain reporting on smart grid data. And this is aggregate reporting.

And there’s a question about how the data ought to be aggregated. And the administrative law judge made one proposal, which is, of course, the law. But a power company in Southern California is exploring using differential privacy as an easier and more powerful technology to at least comply with the spirit of the law.

IRA FLATOW: I see. As the internet of things takes off, we’re going to have sensors in everything we wear, we drive, things like that. Is this definition of differential privacy, is this something that we should get used to hearing, because this is how this data will be collected? Matthew?

CYNTHIA DWORK: I can’t predict the future.

MATTHEW GREEN: I hope so. I mean if companies that are building IoT, Internet of Things, devices start building in privacy protections of any kind, we’re going to be a lot better off than we are today.

IRA FLATOW: And can it be made stronger? Or what kind of research can you do on it to make it better?

CYNTHIA DWORK: So any kind of disclosure leaks a little bit of information. And there’s a fundamental law of information recovery, which says that overly accurate estimates of too many statistics can completely eventually destroy privacy. And this can no more be circumvented than can be the laws of physics. The goal of algorithmic research on differential privacy is to postpone this inevitability and to push to the theoretical limits.

IRA FLATOW: So you have to decide how much– I understand what you’re saying. There’s a trade off here between collecting more data and less privacy or more privacy and less data. And you want to find a comfort zone there, because it will start leaking some of that information about the person. Would that be right, Matthew?

MATTHEW GREEN: Right, so eventually, if I ask you the same question every day, and you answer that question, even if you randomize your answers and add noise, over time I’m going to learn something about you if you continue to answer honestly. And so the danger there, of course, there are other questions you could ask me that are correlated with that first question. So instead of ask me, did I steal from my employer, you could ask me, have I ever stolen from anybody. And so the danger there is actually answering all of those very specific questions of how often should I ask the question, how should the question be phrased, and what do I do to prevent that information from eventually being tied to a person.

IRA FLATOW: Because if you ask it enough times, you might get better answers. You have to actually, I guess, figure out where that differential point is about the question and the answer and how much–

CYNTHIA DWORK: So what differential privacy lets you do is it lets you measure the cumulative privacy loss over many analyzes so that you can scale your noise accordingly in order to stay safe. But eventually, your answers will become pure noise.

IRA FLATOW: It will become pure noise, which–

CYNTHIA DWORK: Eventually.

IRA FLATOW: –which means the answers are worthless?

CYNTHIA DWORK: Overly accurate estimates of too many statistics is blatantly non-private. There’s no getting around that.

IRA FLATOW: So if you become too accurate, you give up privacy. But so what you want–

CYNTHIA DWORK: And it has nothing to do with differential privacy. This is just a fact.

IRA FLATOW: OK, so then you want to have enough privacy, but not too much so that you can get some of the information out. And we have to, you and technologies have to decide where that comfort zone is.

CYNTHIA DWORK: Where the comfort zone is an interesting question. If I were to say to you, is a week a long time? What’s the answer to that question? I don’t know it depends. It depends on the context. And humans have developed some notions about the value of time. Similarly, we have a particular measure of privacy loss. And we need to develop a real understandings of what these numbers mean and in which contexts.

IRA FLATOW: In other words, would I be willing to give up some of the convenience of texting ahead and having it guess what I want, but get a little bit more privacy?

MATTHEW GREEN: Well, to be clear, these systems right now are opt-in. So you do have the option to opt out if you feel like you don’t trust this technology now. But just to be clear, on a daily basis, we type things into Google that we probably wouldn’t tell our closest friends. So anything that improves over what’s currently being done in the tech industry is probably going to be a big improvement.

IRA FLATOW: All right, we’re going to leave it at that. Thank you both, Cynthia Dwork, co-creator of differential privacy, distinguished scientist at Microsoft. She’s based out there in Mountain View, California. Matthew Green, assistant professor at the Information Security Institute on the other coast at Johns Hopkins University in Baltimore. Thank you both for taking time to be with us today, and have a happy holiday weekend.

Copyright © 2016 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of ScienceFriday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies.

Meet the Producer

About Christopher Intagliata

@cintagliata

Christopher Intagliata was Science Friday’s senior producer. He once served as a prop in an optical illusion and speaks passable Ira Flatowese.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description