How Imperfect Data Leads Us Astray

Subscribe to Science Friday

a graphic of a hand drawn bar graph with a red trend line going up. the farthest right bar is colored red. behind the graph is a blue paper torn to reveal gray paper under it — Credit: Shutterstock/Unspalsh/designed by Lauren Young

Datasets are increasingly shaping important decisions, from where companies target their advertising, to how governments allocate resources. But what happens when the data they rely on is wrong or incomplete?

Ira talks to technologist Kasia Chmielinski, as they test drive an algorithm that predicts a person’s race or ethnicity based on just a few details, like their name and zip code, the Bayseian Improved Surname Geocoding algorithm (BISG). You can check out one of the models they used here. The BISG is frequently used by government agencies and corporations alike to fill in missing race and ethnicity data—except it often guesses wrong, with potentially far-reaching effects.

Segment Guests

Kasia Chmielinski

Kasia Chmielinski is a technologist and affiliate at the Berkman Klein Center of Harvard University in Jersey City, New Jersey.

Segment Transcript

IRA FLATOW: This is Science Friday. I’m Ira Flatow. Many of you have probably run into this issue. A survey asks you for your race or ethnicity but none of the options quite fit, right?

It’s frustrating when you’re filling it out. But it’s about more than that. It means the data the survey’s collecting isn’t really accurate. And depending on how it’s used, you can imagine this can have serious implications.

My next guest is interested in all the ways imperfect data can lead us astray. Kasia Chmielinski is a technologist and affiliate at the Berkman Klein Center at Harvard University researching data ethics with a group called the Data Nutrition Project. Welcome, Kasia.

KASIA CHMIELINSKI: Hi, Ira. It’s great to be here.

IRA FLATOW: OK, so let’s get right into this. Imperfect data. What does that mean? Why should we pay attention to it?

KASIA CHMIELINSKI: Right. Well, I spend a lot of time thinking about data and its impact. It turns out that no data set is perfect. So how a data set was collected, who collected it, when it was collected, all of that can affect the quality of the data and what it can be used for. And especially when we aren’t aware of these things, we can end up misusing that data in really important ways.

IRA FLATOW: Interesting. All right, give us an example of where we would come across, let’s say, an imperfect data set.

KASIA CHMIELINSKI: Yeah, so like many people, I recently signed up for a COVID vaccine. The state form asked for my name, which is Kasia Chmielinski, and my zip code. That’s fine. I know those.

It also asked for my race and my ethnicity. And I could choose one of each. But here’s the trick. There are only a few choices. I can choose one of white, Black, Asian, Hispanic, et cetera.

But the problem here is that I’m mixed race. I’m half Asian. I’m half white. And I can only choose one answer. So what do I do? Well, I can choose an inaccurate answer, right? Or I could not provide an answer.

And if I leave it blank, I actually risk someone guessing for me later on when they input the data. So none of these options are really ideal.

IRA FLATOW: I see what you’re talking about. We’re already seeing some problems at the data collection stage. OK, how can this cause problems?

KASIA CHMIELINSKI: It can cause a lot of problems, right? The first is my frustration at the doctor’s office. But more importantly, at a population level, this means that we have race and ethnicity data that’s incomplete or it’s inaccurate.

And when you’ve got something as serious as a pandemic, especially given systemic issues around race in this country, we really need to be able to answer questions like, is one race or ethnicity getting tested more than another? Who’s falling ill? Who’s receiving the vaccine? Where do we need to target funding, right? And for all these questions, we need really accurate demographic data.

IRA FLATOW: You know, that reminds me of the old computer phrase GIGO– garbage in, garbage out. If you don’t have accurate data going in, you’re not going to have accurate results.

KASIA CHMIELINSKI: That’s exactly right. And the COVID data is actually very incomplete. So in early 2020, race and ethnicity was missing from 3/4 of the data about coronavirus cases. It has improved since then. But we’re still missing about half of that data.

So imperfect data sets have been around for as long as data sets have. And we’ve come up with a lot of clever mathematical ways to get around things like missing data. We call this imputing data.

IRA FLATOW: OK, so what are these clever ways?

KASIA CHMIELINSKI: So these are basically mathematical formulas or algorithms that take in a little bit of information about someone. And it spits out a prediction for the person’s race and ethnicity. So there are a few free versions online. So I set up the models on my computer. This is using openly available code on the internet. And I figured we just give them a go.

IRA FLATOW: Oh, cool. Can you throw my name in there?

KASIA CHMIELINSKI: Yeah, OK. Let’s do it. Let’s go step by step. This is in, like, a shared notebook between me and my colleagues here. So it’s just filled with spaghetti code, is what we call it.

So I can put your name in here, last name. So let’s see. That first version here believes that you are– oh, it’s interesting. OK, so you come up as 87% probability that you’re white and a 12% probability that you’re Black.

IRA FLATOW: Cool.

KASIA CHMIELINSKI: Yeah, it’s really interesting. Now, this next model is actually a model called Wiki model. And there’s just a ton of these on the internet. It’s actually really scary [LAUGHS] how many there are.

So this one here takes your first and your last name. And if I scroll over here, we’ve got 14% potentially that you’re British. We’ve got 8% Eastern European. 62% Greater European Jewish.

IRA FLATOW: That’s interesting. Well, look. It got some key data right about me. I don’t know if it was the Ira that tipped off the Jewish or my last name.

KASIA CHMIELINSKI: I don’t know. And this is the tricky thing about these models, as well, is they don’t tell you why this is happening.

IRA FLATOW: OK. All right, we got my name in there. Kasia, let’s go with your name.

KASIA CHMIELINSKI: OK, so if I start with the first model here, I put my very Polish last name in. And it comes out with a 98% chance that I’m fully white, which is not true. I actually identify as mixed race. But it’s missing the Chinese part of my identity because it’s looking at my last name, which is my dad’s name. And he’s white.

And if I move on to the next version of this, which takes a little bit more information– so now it’s also requiring my first name. I put in my first and my last. And it comes out with just a 50% chance that I’m white. And suddenly there’s a 10% chance that I’m mixed race, which is actually how I identify.

And then the third version here requires the most information of them all. And that’s my first, my last, and a zip code. So I put all of that information in there. And suddenly now the model is drawing a blank, most likely because my name, my surname or the combination are just too infrequent for the model to have really seen that before. And so it comes back with no information whatsoever on what probability my race could be.

IRA FLATOW: That is amazing, the way you talk about this. I would have thought that these were very sophisticated tools and the more data you put in, the better the results should be. Why is it so inaccurate with your information?

KASIA CHMIELINSKI: So the model that thought there was a 10% chance of me being mixed race is actually quite a famous algorithm. It’s called the Bayesian Improved Surname Geocoding algorithm, or BISG. And it’s very widely used.

BISG was created in 2008 by the Rand Corporation. And it’s been extremely influential in imputing race and ethnicity data where that information is missing or was never collected at all. But it’s not so much that the model is inaccurate in this case. The math itself is fairly neutral.

What we have to do is look at how the model was trained and, in particular, any weirdness with the underlying data that it was trained with. So every model requires this training data. And if there are issues with the data, the model is also going to come up with bad results.

So when we think about BISG, we need to then turn our attention to the data set that it was built on, which is one you might have heard of. It’s called the Census.

IRA FLATOW: Oh, yeah, I’ve heard of that. [LAUGHS] What’s wrong with the Census?

KASIA CHMIELINSKI: Well, in this case, the Census was really meant for districting purposes. It was created and it’s deployed every 10 years for that reason. And it’s really not meant to be used by all these other tools such as BISG. And that can have unintended consequences.

So first of all, BISG is a subset of the 2000 and the 2010 Census data. So it includes only folks who are in the US and answered the Census 11 and 21 years ago, which means that we’re leaving out newer immigrants and those marginalized communities that are less likely to answer the census at all.

IRA FLATOW: Because a lot has happened since 10 years ago, hasn’t it?

KASIA CHMIELINSKI: Yeah, especially 21 years. So imagine all the migration patterns and the folks who’ve come in and out of the country then. None of that’s going to be captured in the Census data from 20 years ago. Another thing here is the data set only captures a person’s current surname. And many people, mostly women, have changed their last name, like my mom who took my dad’s name when they married.

And finally, the data includes a surname only if more than 100 people have the surname. So there are four million last names. And that covers 30 million people who are not included because their surnames are just too uncommon

IRA FLATOW: Hm, so does this mean that the tool is more accurate for, let’s say, John Smith but less accurate for somebody like you, Kasia Chmielinski?

KASIA CHMIELINSKI: Yes, that’s exactly right. So there’s just more data on John Smiths because there are more John Smiths. So the model has more to go on when it’s trained. In fact, the model performs best with males and people 65 years or older and for those who identify as white because those are the best represented demographics.

But the model is extremely insensitive. In some cases it’s basically useless on particular communities like American Indian, Alaska Native, and multiracial communities.

IRA FLATOW: OK, you said the model is used for many purposes. What kind of impacts, then, can we expect these limitations to have?

KASIA CHMIELINSKI: Yeah, it’s a great question. So for example, the BISG algorithm is actually used by several federal agencies, including the Consumer Financial Protection Bureau. One of the things the Bureau does is it finds lenders if they violate fairness rules.

So for example, the Bureau ordered Ally Bank to pay $98 million in damages to minority borrowers in 2013. They used BISG on auto lender data sets that did not have race or ethnicity data to determine that African-American, Hispanic, and AAPI borrowers paid $200 to $300 more in interest than their non-Hispanic white counterparts in the same geography.

So it’s actually really good that BISG could detect that bias. But in the actual payout, some white Americans received checks. And some non-white borrowers had to apply. So the algorithm didn’t get it totally right at the individual level.

More recently, BISG’s also been used extensively to fill in the missing COVID-related data that I talked about in the beginning. And there’s a lot at stake here because a majority of the recent funding from the CDC is actually earmarked for a program that increased vaccination equity.

And how are you supposed to determine what’s equitable if you don’t have that data? So inaccurate data at this level could mean millions or even billions of dollars going to approximate rather than actual areas of need.

IRA FLATOW: I’m thinking about my household appliance when it’s broken. Is it better to fix it or throw it out and just buy a whole new one? Can we fix this model? Or is it better not to use these tools at all?

KASIA CHMIELINSKI: Yeah, I think that’s the first time anyone’s ever compared BISG to a household appliance. But I like it.

[LAUGHTER]

And in this case, I’d say keep the refrigerator, right? I don’t think that the answer is not to use the tool, right? It’s just to use it thoughtfully. So BISG is actually very important. And it’s pretty good at what it was meant to do, which is to predict race and ethnicity for an entire population, especially majority populations.

The key here is the model is really only as good as the data it was trained on. So if you want to mitigate harms, this is the place to start. It’s not to throw away the entire refrigerator, right? It’s to focus on the ways that it’s broken and then, in this case, identify better data or improve the data and then work up from there.

IRA FLATOW: OK, so as we wrap up here, what are the take-homes that we should be paying attention to?

KASIA CHMIELINSKI: Yeah, that’s a good question. I think with data science, especially data science when you apply it to people, there often isn’t a single right answer or perfect data set. And we have to keep that in mind.

Every data set we build is going to have inherent bias, not to mention whatever bias it picks up from society. And our goal isn’t to remove the bias entirely, because it’s not possible. Rather, we have to understand it so that we can mitigate those issues.

But despite all those challenges, it’s also very important that we continue to find ways and innovate to make our data sets more complete. And that means filling in the gaps with tools like BISG so that we can track and address potential discriminatory harms and get closer to a better solution.

So the result of this is always going to be an approximation of reality. And we need to be constantly monitoring and improving our data sets and our models to assess just how far from the truth we believe we are.

IRA FLATOW: Terrific report. Thank you for taking time to be with us today.

KASIA CHMIELINSKI: Thanks.

IRA FLATOW: Kasia Chmielinski, a technologist and affiliate at the Berkman Klein Center at Harvard University researching data ethics.

Copyright © 2021 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of Science Friday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies/

Meet the Producers and Host

About Elah Feder

@ElahFeder

Elah Feder is the former senior producer for podcasts at Science Friday. She produced the Science Diction podcast, and co-hosted and produced the Undiscovered podcast.

About Harriet Bailey

Harriet Bailey is a science producer and director with works on the BBC, National Geographic, Discovery Channel, Al Jazeera, PBS, and more.

About Ira Flatow

Ira Flatow is the founder and host of Science Friday. His green thumb has revived many an office plant at death’s door.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description

How Imperfect Data Leads Us Astray

Subscribe to Science Friday

Further Reading

Segment Guests

Segment Transcript

Meet the Producers and Host

About Elah Feder

About Harriet Bailey

About Ira Flatow

Explore More

Subscribe to Science Friday

Further Reading

Segment Guests

Segment Transcript

Meet the Producers and Host

About Elah Feder

About Harriet Bailey

About Ira Flatow

Explore More

Decrypting Big Tech’s Data Hoard

When Scientists Get It Wrong