65 Genomes Expand Our Picture Of Human Genetics

Subscribe to Science Friday

The first complete draft of the human genome was published back in 2003. Since then, researchers have worked both to improve the accuracy of human genetic data, and to expand its diversity, looking at the genetics of people from many different backgrounds. Three genetics experts join Host Ira Flatow to talk about a recent close examination of the genomes of 65 individuals from around the world, and how it may help researchers get a better understanding of genomic functioning and diversity.

Donate To Science Friday

Invest in quality science journalism by making a donation to Science Friday.

Donate

Segment Guests

Christine Beck

Dr. Christine Beck is an associate professor of genetics and genome sciences in the University of Connecticut Health Center and the Jackson Laboratory.

Segment Transcript

IRA FLATOW: This is Science Friday. I’m Ira Flatow. Remember the Human Genome Project? Well, the initial draft was declared complete back in 2003, but researchers then realized that one genome doesn’t paint a complete picture of the human race. So fast forward a decade or so, there came the 1000 Genomes Project, an attempt to expand the picture by sampling people from all over the world with different backgrounds and try to get a fuller look at how we’re the same or how we’re different.

Writing this month in the journal Nature, two teams of researchers take another look at some of those 1,000 genomes, resequencing, reassembling with more advanced techniques to lessen the number of typos, and really firm up how the pieces of the genome puzzle fit together.

Joining me now to talk about how it’s going are my guests, Dr. Christine Beck, associate professor of genetics and genome sciences at the University of Connecticut Health Center and the Jackson Laboratory. She’s in Farmington, Connecticut; Dr. Glennis Logsdon, assistant professor of genetics and core member of the Epigenetics Institute at the University of Pennsylvania in Philly; and Dr. Adam Phillippy, a senior investigator in the Center for Genomics and Data Science Research at the National Human Genome Research Institute at the NIH in Bethesda, Maryland. Welcome all of you to Science Friday.

GLENNIS LOGSDON: Thanks.

ADAM PHILLIPPY: Thanks, Ira. Great to be here.

IRA FLATOW: Nice to have you all. Dr. Logsdon, let me start with you. What’s the thousand-mile view of this paper? What were you all trying to do here?

GLENNIS LOGSDON: Yeah, the goal of the paper was really to generate complete sequence assemblies of 65 diverse human around the world. And we were mainly interested in trying to resolve all sequences within all the 46 chromosomes within these genomes. And that includes the most challenging regions of the genomes that have been kind of plaguing scientists for decades to try to resolve their regions.

And the point in doing this is to understand the genetic and epigenetic variation of these genomes, to understand how we differ in our sequences, our structures, and trying to understand how proteins also differ amongst us. Some of the most interesting regions that we resolved are the centromeres, which are these essential chromosomal regions found on every single chromosome in our genome.

And they’re important for ensuring that our chromosomes are equally and accurately segregated during mitosis and meiosis. And so, for the first time, we were able to solve about 1,200 centromeres from among these 65 genomes and understand how they differ between each of us and what that difference might mean in terms of function.

IRA FLATOW: Does that mean the older genome sequences had problems with them?

GLENNIS LOGSDON: Yes, that’s exactly what that means. Sequencing technologies from about a decade ago weren’t really able to resolve these regions of the genome, and that’s because they’re really highly repetitive and very large. And the sequencing data that was generated back then was smaller than the size of the repeat itself. But now we’re able to actually resolve these regions in their entirety, traverse them from one side of the chromosome to the other, and finally get complete maps at high resolution of these regions of the genome.

IRA FLATOW: Interesting. Dr. Beck, tell me more about the 65 individuals that Dr. Logsdon talked about. Where did these samples come from?

CHRISTINE BECK: Sure. The 65 samples were actually part of the 1000 Genomes Project. So the samples are from around the world. And basically, with data from previous sequencing projects, we had a good handle on how much variation there was between individual samples and a reference genome. So therefore, we chose cell lines from individuals that would maximize the amount of novel sequence variation that was discovered in our work because if we sequenced a bunch of individuals that were really, really similar, we’d have less return on our investment for sequencing each individual person.

So we sequenced these 65 people, and from them, we discovered a large amount of DNA variation from person to person. And part of the reason why that’s important is because we don’t really have a good handle on how much DNA variation there is in some of these complex regions of the genome. So without a good understanding of that kind of background topology of the genome, it’s really, really hard to separate benign differences in the population from pathogenic.

IRA FLATOW: So you really did find diversity, similarity, dissimilarity in all these different genomes?

CHRISTINE BECK: So we looked between all of these genomes, and between them, we cataloged 188,000-plus variations between people that were greater than 50 base pairs in length.

IRA FLATOW: Is that surprising, all those?

CHRISTINE BECK: To a certain degree, it’s not. So previous studies looking at an individual genome versus the reference assembly, we were able to find a decent number of variants, but in these complex regions, like Glennis was talking about in other loci comprised largely of repeat sequences, we were able to uncover vast amounts of differences between individuals in the population that had heretofore been undiscovered because of the quality of sequencing.

So just as a quick side by side example, just four years ago, we published the human genome structural variant consortium– at the time published another study where they cataloged variants between people. And with that sequencing technology, there were almost 2,000 fewer variants between every individual and the reference genome than there are in this recent compendium.

IRA FLATOW: Wow. Dr. Phillippy, what is it that is letting you do this work now? Is it better machinery to actually do sequence? Is it better tools to assemble all the data? What is it?

ADAM PHILLIPPY: Yeah, all of the above, including better computers. You mentioned, at the top, the 1000 Genomes Project, which was initiated almost 20 years ago now. That was the original collection and sequencing of these samples. But it wasn’t just tripos that we had at the time. There was, if you want to continue the book of life analogy, entire pages missing from each of these individual genomes. And as was noted by Glennis, a lot of this was due to repeats. And so these repeating pages, repeating sentences, so to speak, in the genome, are just like when you’re doing a jigsaw puzzle– hard to put back together again when they’re highly repetitive.

IRA FLATOW: So how complete do you think you’ve gotten? Are we finished?

ADAM PHILLIPPY: Yeah, I was on a number of years ago talking about this Telomere to Telomere Project, which was the first completion of an entire human genome. And we estimated, at the time, that that filled in about 8% of what was missing after the initial Human Genome Project from the early 2000s. And I would say that number holds about the same for these genomes. And so for all of genomes presented here, each of them has about 8% more sequence than the initial product of the Human Genome Project from 2003.

And the technology, the sequencing methods are able to read a longer stretch of DNA at a time. That helps. The computational methods have advanced. And we have better and more accurate methods of putting those puzzles back together again. And just handling this sheer quantity of data, generating millions and millions of sequencing reads from all of these individual genomes and putting it back together again, is really only possible with the advance of computing that we’ve seen over the past couple of decades as well.

IRA FLATOW: So how close are we to a final number or a final end to all the sequencing?

ADAM PHILLIPPY: Well, how many billions of people deep would you like to go? Obviously, we’re just scratching the surface here, but as Christine said, we’re trying to do it in a way that of maximizes our return on investment. And so we can go into a population of people and pick out the ones that look most different from one another, sequence those first. And then, over time, we start to saturate the amount of variation that we return. So now we’re talking about 50-ish genomes. In the next year or two, we’ll be talking about thousands of genomes. And if this field just continues to increase exponentially like it has over the past two decades, yeah, the sky’s the limit.

IRA FLATOW: Now you mentioned that this really helped fill in the bits of the genome that repeat over and over. What does that tell us? Why does it repeat over and over? Is there information there?

GLENNIS LOGSDON: Yeah, there’s absolutely information in the repetitive regions of the genome. And not only just information, there’s function I mentioned earlier the centromeres. They’re some of the most mutable, highly dynamic regions of our genome. And they’re so mutable that I haven’t yet seen two centromeres that look identical across humans. And despite this variability, we can see quite a bit of variation that, in fact, affects function. So we find that when certain regions of the centromeres are deleted or expanded or duplicated, this could actually affect the way that the chromosomes segregate during meiosis and mitosis.

IRA FLATOW: Is this repetitive stuff what we once called junk DNA?

GLENNIS LOGSDON: It is. It is exactly what you would call junk DNA. But we know for sure that it’s not junk DNA. In fact, it’s very functional, important regions of our genome. It’s important for life. If we didn’t have these regions of the genome, then we wouldn’t be able to live.

CHRISTINE BECK: I think that that’s kind of an important part to touch on because I think repeats of all classes really shine with these novel techniques and novel sequencing modalities, as well as the assemblies. So both the centromeric repeats that Glennis studied, as well as segmental duplications and complex kind of different ways of arraying those puzzle pieces, from beginning to end, have begun to come to light with these new sequences.

And from that, you can infer whether or not the mutations or the differences between these people have actually affected the coding sequences of genes embedded in these repeats or whether or not it might have changed the cisregulatory landscape. Like, let’s say, the ability to turn a gene off or turn it up to 11 is also altered between some of these genomes. So getting a good picture of that repetitive nature of the underlying sequence is really, really key to understanding differences in function downstream.

IRA FLATOW: Turning a gene up to 11 is something we haven’t spoken about before. How much data do you need to have to tell if something is, quote unquote, “normal” genetically? I just throw that out to any of you.

ADAM PHILLIPPY: I think that’s a great question. And I think it’s really the power of this type of fundamental knowledge generation that we’re doing in these types of projects. Being trained as a computer scientist, I think a lot from that lens.

And in a similar way that something like AlphaFold succeeded at protein prediction, based on this foundation of the protein data bank that was decades in progress, we’re building now this foundation of what typical human genomes look like. And I think, in the next few years, we’ll see genomic language models, so to speak, trained on that data and be able to predict associations quite accurately between atypical sequences and their disease associations.

Exactly how many sequences you need and how many people with diseases and without diseases you need in that training set always depends on the type of the disease, how complex those associations are, and so forth. But I think we’re rapidly approaching a tipping point in being able to make very accurate predictions off of this genomic data alone.

IRA FLATOW: And what kinds of predictions are we talking about?

ADAM PHILLIPPY: So imagine, as a thought experiment, we just mutate a random base in your genome. How well do you think we can predict the effect of that mutation, whether it will be deleterious or not? We’re not quite that good at it, compared to some other aspects of prediction.

But with these resources, we’re getting much, much better– in particular in the noncoding regions of the genome that Dr. Beck was just mentioning. A large fraction of those mutations– you have many millions of them in your genome compared to a typical reference genome, and the vast majority of them are benign. But the few that matter are the important ones. And we’re going to get much better in the coming years at making those predictions and being able to spot, basically at birth with DNA sequencing, those predictions, those variants that will likely result in some form of genetic disease.

IRA FLATOW: Would I be wrong in assuming, Dr. Phillippy, that, as a computer scientist, you’re using a lot of AI here?

ADAM PHILLIPPY: More and more, it’s embedded into a lot of the things we do. The sequencing technologies that we’re using to read off the DNA are using state-of-the-art AI methods to make a prediction from the electrical current or the optical image that you’re seeing to the actual As, Cs, Gs, and Ts. So that translation process uses AI. And yes, these kind of DNA models that I was referring to are also coming of age now, and people are actively using them to make predictions of the suspected pathogenicity of a variant that you see in one genome compared to another.

IRA FLATOW: Final question. I’ll send it to you, Dr. Beck. I remember when the Human Genome Project was announced. It was hailed as a major breakthrough in helping to cure illnesses down the road. How has that been working out? How would you grade the success so far and looking forward?

CHRISTINE BECK: Oh, nice, a softball.

[LAUGHTER]

So I think that at the end of the day, I think that the sequencing of the human genome has allowed a lot of inference into Mendelian diseases. So the architecture of diseases that are highly penetrant in the population, where you have a clear variant and effect– so a cause and effect that you can tie together very clearly– those things have really been helped astronomically by the development of the human genome reference sequence.

And then stepping into the more murkier territory of complex disease genetics, I think that there’s still a lot of work to be done to figure out the underlying genetic architecture of those diseases and understanding the combinatorics of alleles and variants that come together to equal the predisposition to diseases with environmental factors added to them.

So I think that getting back to what Dr. Phillippy said earlier, I think that an understanding of this is probably going to be borne out by a much better understanding of variation in genomes, which we’re gaining with studies like ours, mixed with machine learning approaches to plumb the depths of those data for variants that might, in aggregate or individually, contribute to these complex diseases. So long story short, I think that there has been a lot of progress. But I also think, in the future, there’s a lot of work and progress to be done.

ADAM PHILLIPPY: I think I would be remiss to not give credit to the initial Human Genome Project that we’re building on here that finished up, as you said, about two decades ago now. And I find it really informative to look back and realize that project took about 10 years and, in today’s dollars, about $5 billion. Each of these individual genomes that we’re doing now at a better quality can be done in basically a few days for around $5,000. And so just do the simple math. That’s a million-fold reduction in the costs to sequence a human genome thanks to these research investments that have been made over the past 25 years by the NIH and by my home institute, NHGRI.

And it’s just amazing to reflect on the progress that this field has undergone over the past 20 years with those investments. And so if you look back at the economic impact on that, there was a study in 2013 that estimated the economic impact of the Human Genome Project at $1 trillion, and that was 10 years ago. Imagine what those returns are now. So this Human Genome Project is just a gift that keeps giving, both in terms of economic terms and in terms of quality of life.

IRA FLATOW: Well, I want to thank all of you for taking time to be with us. This has been informative. I imagine you’re all very, very hopeful about the future.

CHRISTINE BECK: Yeah, absolutely.

IRA FLATOW: Please come back and tell us more about where this is heading when you get a chance.

CHRISTINE BECK: Will do, thanks.

ADAM PHILLIPPY: Thanks much, Ira.

GLENNIS LOGSDON: Thank you so much.

IRA FLATOW: You’re welcome. Dr. Adam Phillippy at the National Human Genome Research Institute– that is at NIH in Bethesda. Dr. Christine Beck of the University of Connecticut Health Center and the Jackson Laboratory, and Dr. Glennis Logsdon at the University of Pennsylvania. Thank you all for taking time, as I say, to be with us today.

Copyright © 2025 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of Science Friday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies/

Meet the Producers and Host

About Charles Bergquist

As Science Friday’s director and senior producer, Charles Bergquist channels the chaos of a live production studio into something sounding like a radio program. Favorite topics include planetary sciences, chemistry, materials, and shiny things with blinking lights.

About Ira Flatow

Ira Flatow is the founder and host of Science Friday. His green thumb has revived many an office plant at death’s door.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description

65 Genomes Expand Our Picture Of Human Genetics

Subscribe to Science Friday

Further Reading

Donate To Science Friday

Segment Guests

Segment Transcript

Meet the Producers and Host

About Charles Bergquist

About Ira Flatow

Explore More

Subscribe to Science Friday

Further Reading

Donate To Science Friday

Segment Guests

Segment Transcript

Meet the Producers and Host

About Charles Bergquist

About Ira Flatow

Explore More

Scientists Release The First Fully Complete Human Genome

A Tiny Fern Has The Largest Genome Ever Discovered