A Thumb Drive Made of Genes?

Subscribe to Science Friday

What do Shakespearian sonnets, a music video by the band OK Go (shown below), and an Amazon gift card all have in common?

All of these are pieces of data that have been successfully encoded in strands of DNA. The process takes the 0s and 1s of binary code and turns them into the As, Ts, Cs, and Gs of DNA nucleotides that can then be synthesized, and sequenced later to retrieve the information.

Why DNA? In theory, it could solve many of the problems of modern data storage, and hold greater densities of information for longer periods of time without the typical worries over obsolescence or corruption.

Since pioneering work from Harvard in 2012, research has advanced in terms of how much information can be encoded in DNA, and how accurately.

But what will it take to make DNA a mainstream data storage option?

Columbia University computational biologist Yaniv Erlich’s team reported in Science this week that they successfully stored an operating system, an 1895 French film, and several other files at densities previously unachieved. He joins electrical and computer engineer Olgica Milenkovic, of the University of Illinois at Urbana-Champaign, and biochemist Sriram Kosuri, of the University of California-Los Angeles, in a discussion of the hurdles ahead for biological data storage.

Segment Guests

Yaniv Erlich

Yaniv Erlich is an assistant professor of Computer Science and Computational Biology at Columbia University as well as a core member at the New York Genome Center. in New York, New York.

Segment Transcript

IRA FLATOW: This is Science Friday, I’m Ira Flatow. DNA– it’s the blueprint for life, but it’s also a string of letters– the As, the Cs, the Gs, and the Ts, not unlike the zeros and ones of the binary code we use to store data. And this similarity has led scientists to speculate that perhaps someday we could use DNA itself to store data more efficiently than silicon. Even better, it would last longer, avoid corruption problems, and never go the way of the VHS tape or the floppy disk.

And this is not a new concept. Researchers have been working on it for years and have successfully stored and retrieved everything from Shakespearean sonnets to photos to a music video from the band OK Go. And one of my guests has even stored an entire computer operating system in DNA. You can play Minesweeper off the code his team retrieved when they sequenced it again.

In 2013, we talked to a researcher who said all the knowledge of humankind written in DNA could fit in the back of a station wagon. So how soon are we actually going to archive all of that? When can you play Minesweeper off a DNA operating system? That’s what we’re going to be talking about today. If you want to join us, 844-724-8255. You can also tweet us @scifri.

Let me introduce my guest. Yaniv Erlich, Assistant Professor of Computer Science and Computation of Biology at Columbia University in New York. He’s a core member at the New York Genome Center. Welcome to Science Friday.

YANIV ERLICH: Hello.

IRA FLATOW: Olgica Milenkovic is professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. Welcome to Science Friday.

OLGICA MILENKOVIC: Hello.

IRA FLATOW: Thank you for joining us. Sri Kosuri is an assistant professor of Biochemistry at UCLA in Los Angeles. Welcome to Science Friday.

SRIRAM KOSURI: Thank you for having us on.

IRA FLATOW: And you have a vial of DNA sitting in front of you, and what is stored on this tiny little bit of it?

YANIV ERLICH: So in this tiny little bit of DNA, we have here a French movie called The Arrival of a Train. It’s a French movie that was filmed in around 1885, so over 100 years ago. And now it’s on DNA.

We have a computer operating system. We have an Amazon gift card of $50. We have a manuscript, one of the most influential manuscripts in information theory. And we also put the Pioneer plaque that we sent to space, so we put the figure over there, and a computer virus on DNA.

IRA FLATOW: All in that tiny little, it looks like the head of a pin.

YANIV ERLICH: Exactly.

IRA FLATOW: Yeah, why is the gift card in there?

YANIV ERLICH: So the gift card is a funny thing. We wanted to encourage someone to reproduce our study. So in fact, I contacted one of my Twitter followers that was really interested in what we are doing. And I told him that you can download the sequencing data from this European archive that we store the data.

Here is the code. And if you can recover the files, you can have the $50 and just purchase something nice for you. So he did this operation, was able to get the file, and got a book about machine learning. We are all geeks.

IRA FLATOW: [LAUGHS] You’re in good company. Sri, we were talking about DNA as if putting data on it were an everyday thing, right? But this was sort of a gee whiz story even a year ago. How do you get the encoding on DNA?

SRIRAM KOSURI: Well, the encoding, when we did it many years ago in a very dumb way was essentially just ones or zeros replace Ts or Cs and Gs or As. So it was a very redundant code that wasn’t super dense when we did it just as a kind of proof of concept. What Yaniv done in this new work is really quite nice, which DNA has a lot of negative sides to it. So when we synthesize our sequenced DNA, we have all sorts of dropouts or certain sequences are hard to read or write.

And so Yaniv’s code, which he borrowed from computer science and these fountain codes, are basically taken from spotty transmission algorithms that are meant to convey information over noisy communication lines and used it to encode DNA, encode this information in DNA. And what that’s allowed him to do is really bring to scale the types of ideas that have been around for even before we were working on it, decades now, into something that seems pretty reliable in his newest work.

IRA FLATOW: Olgica, you let me ask you, how reliable is DNA? Why put all your photos on it? What are the benefits?

OLGICA MILENKOVIC: So if you think of DNA as the media, and my background is coding theory, and basically coding theorists have been [INAUDIBLE] magnetic optical racetrack memories for a very long time. And the interest is really to allow reliable writing, reliable reading of the information. And storing in DNA has many benefits, because DNA offers exceptional densities as we all know. But it’s also a naturally robust media that can survive for long periods of time.

The issues and the bottlenecks in terms of reliability come from the process of writing, which is basically synthesizing the DNA strings, and reading, which is some form of sequencing. And that’s where the interesting action, at least in coding theory happens, because the readout systems used for DNA storage systems are very different from what we have seen in classical recorders. The types of errors, the types of patterns that are error-prone when recorded are nothing like we’ve seen before, and they require very new approaches in coding theory, which is why I’m very excited about this field.

IRA FLATOW: Yaniv, is it error-free recording or how?

YANIV ERLICH: So it’s not error-free. We have some errors in the synthesis and the sequencing, as Olgica said. But our method allows us to go and to correct for these errors, and it’s highly robust. For instance, we showed that every time we read the DNA, we consume some amount of material.

Now for instance, my daughter, she loves to hear Frozen. So every day we hear maybe five times, the song of Frozen. So after a week, if Frozen was encoded in DNA, we would run out of material at all. So using our technique, we show that we can copy the DNA, copy the copy, copy the copy of the copy, and so one, nine times.

And we can retrieve– although we introduce more errors, we can retrieve and correct these errors and get a file that is fully accurate. We can watch this movie, we can play Minesweeper on the operating system. So it’s fully robust.

IRA FLATOW: But it’s not as fast yet as a hard drive, as writing and reading that kind of stuff.

YANIV ERLICH: Much slower than the hard drive, but copying exact copies of the data is extremely fast because it’s an enzymatic reaction.

IRA FLATOW: Hmm. And Olgica, I understand that you’re working on making DNA not just readable, but random access and rewritable like a hard drive, right?

OLGICA MILENKOVIC: So this is a big, interesting project that I’ve been dreaming of the last few years about. And as I mentioned, given the background of my group, which has been working in data storage for over 20 years now, we realized that there is so much that can be done with this media. And as you mentioned, random access is one thing. Another thing we did as well is make it portable, because as people that are really related to the storage industry, we know that once you have the data stored, you also need to read it in a very delay-tolerant manner, which we are not very close to doing right now.

But you also want to have your readout system handy. You don’t want to carry big sequencing devices that cost anywhere from half a million dollars to even more with you. So our recent work is focused on making systems that are portable. And we switched to using nanopore sequencing devices for that purpose.

IRA FLATOW: Let’s go into the phones. A lot of geeky folks like us are calling in. Let’s go to Eric in Fayetteville, Arkansas. Hi, Eric.

ERIC: Hi, Ira. I’m curious, how vulnerable is this data to corruption? Could a DNA drive mutate, and what would to the data if it did?

IRA FLATOW: Could it mutate? Wow, Yaniv, is it going to mutate? Thank you for the question.

YANIV ERLICH: That’s a great question, thank you so much. So we showed, in fact this copying process mutates the DNA quite a lot. It introduces many, many errors. But we have this type of error correcting codes of redundancy that allows us to complete that information. I will give you an analogy. The way that we encode the data is not kind of like we just transmit the file.

We first organize the file as if it was Sudoku puzzle. And then what we show– but it’s a very simple Sudoku puzzle, like a kids’ version. And we just send many, many hints. Every DNA molecule is a hint about the file.

Now, think that that I will give you a DNA Sudoku puzzle for kids, and I’m going to be mean. I’m going to erase some of the cells. Probably you can still go and complete the Sudoku puzzle, although I was mean and didn’t give you the entire puzzle. This is the same way that it works here. Although not all the DNA molecules are going to make it, some of them are erroneous, we can still solve the puzzle and get back the file.

IRA FLATOW: Mm-hm. Sri, you’re a biochemist. Do you worry about things mutating, like DNA?

SRIRAM KOSURI: Well, I think Yaniv is talking about the physical nature of DNA. I think sometimes people think about oh well, this is DNA and it could turn into a real life virus. I think we’ve talked about things like that in the past, but all the storage mechanisms we’ve been talking about thus far are outside of a living organism. So getting DNA back into an organism is extremely difficult.

And so the safety profile of such a thing is A, your likeliness that you’ll actually encode something living is little to none. But even then, getting it inside of the entire machinery of the cell is also extremely unlikely. So I think there’s very little thought that there would be danger there, especially since the size of these DNA molecules are quite short.

IRA FLATOW: Let’s go to Washington, thanks for that answer. Let’s go to Washington. Benjamin, welcome to Science Friday.

BENJAMIN: Thank you. I was curious as to whether you could use this technique with any other chemical structure other than biologically living DNA. Could you design something that’s sort of silicon-based, that would still use this same molecular encoding strategy?

YANIV ERLICH: I think that Olgica did some work?

OLGICA MILENKOVIC: Yes.

IRA FLATOW: Sri? Yeah.

OLGICA MILENKOVIC: Oh, please go ahead, Sri, if you want to say–

SRIRAM KOSURI: Oh, I think one point to make is that this is a biological polymer. And this is just a very specific version of information storage inside of a polymer suit. So one point could be, you could use any other polymer, one that might be easier to read and write, for instance.

But I think one advantage– there’s a couple of advantages of DNA, one being that there exists a wealth of enzymes and natural things that have evolved that make our lives a lot easier for example, than the ability to make copies from individual molecules, incredibly easy through the use of polymerases that have evolved for billions of years. Though we use a different polymer, it would be nice to have the tools to be able to deal with things like that.

OLGICA MILENKOVIC: If I may just add–

IRA FLATOW: Olgica, let me just interrupt because I have to be rude here. This is Science Friday from PRI, Public Radio International. OK, Olgica, go ahead.

OLGICA MILENKOVIC: Oh, sorry, yes. So what I wanted to say to the caller is that there are a lot of endeavors right now that try to use different polymers and synthetic [AUDIO OUT].

IRA FLATOW: Oh, her line has dropped now.

OLGICA MILENKOVIC: Oh, sorry. Can you hear me?

IRA FLATOW: Yes, go ahead.

OLGICA MILENKOVIC: Yeah, a group in France is using synthetic polymers, which have the big advantage of being very easy and cheap to synthesize, which is not the case with DNA. But as Sri pointed out, the fact that it’s a synthetic polymer really makes copying the part of the system that Yaniv is so efficiently using with DNA, copying is not very efficient with synthetic poly– [AUDIO OUT] Oh, go ahead.

IRA FLATOW: So we’re sort of using a CRISPR then, to edit the DNA? Is that one of the tools we might use?

OLGICA MILENKOVIC: Potentially. But it’s still pretty expensive and not very precise.

IRA FLATOW: How do all of you see using DNA in the long run? Is it going to be on a thumb drive, on a server farm? Yaniv, what do you think so far?

YANIV ERLICH: So I think the way that you use DNA is that most users will not even know that they are using DNA as their storage media. We think about something like a cloud service where you want to store some files for a very long amount of time, very cheaply. And there will be a service [INAUDIBLE] for a great price for you. But you will not know that it uses DNA. But this service can now take all the benefits of DNA, the compactness of this molecule, and do all the operation of sequencing, synthesizing, and taking care, rather than us actually carrying DNA with us around the city.

IRA FLATOW: And Sri, how are we going to make this cheaper and make it practical to use?

SRIRAM KOSURI: Yeah, I think that’s probably the million dollar question. Right now we’re on order, a million fold too expensive on the synthesis side. And we have seen drops that are very large in biology. And there are ways to think about getting to such drops. But that’s a long way from here to there, getting to about a million fold drop in cost.

IRA FLATOW: Olgica, what does your vision for the future look like?

OLGICA MILENKOVIC: I believe that it’s going to be a long road ahead. I agree with Sri. Ideally, I would like to see a DNA flash memory, because I think that is the DNA analog of a flash memory, because that would be the first thing we could hope for and the technology is very close to allowing us to do something like that, except for the cost. And I agree with Sri again that we need to drive the cost of synthesis and the delay of sequencing down in order to make this plausible.

IRA FLATOW: Let me get one quick call in from Amy in Manhattan. Hi Amy.

AMY: Hi. I was wondering, do we need to worry about coding DNA being hacked. And would the self-correcting mechanism make that less likely?

IRA FLATOW: Yaniv? Can you hack it?

YANIV ERLICH: So you cannot really hack it. It’s not a living organism. It’s like thinking about your coffee in the morning, that you drink the milk and there is DNA of cow, and somehow this DNA is taking over you. It’s just a molecule that you cannot really hack it, or it’s not a living organism. So I don’t think there is any risk over here.

IRA FLATOW: OK, this is quite fascinating. We’ll have to revisit this. Yaniv Erlich is assistant professor of Computer Science and Computational Biology at Columbia University in New York. Olgica Milenkovic is professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. And Sri Kosuri to is assistant professor of Biochemistry at UCLA in Los Angeles. Thank you all for joining us today.

Copyright © 2017 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of ScienceFriday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies/

Meet the Producer

About Christie Taylor

@ctaylsaurus

Christie Taylor was a producer for Science Friday. Her days involved diligent research, too many phone calls for an introvert, and asking scientists if they have any audio of that narwhal heartbeat.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description

Subscribe to Science Friday

Segment Guests

Segment Transcript

Meet the Producer

About Christie Taylor

Explore More

Is DNA the Future of Digital Data Storage?

A Cautious Go-Ahead for Human Gene Editing