Nick Goldman and his DNA encoding scheme

Forget hard disks or DVDs. If you want to store vast amounts of information look instead to DNA, the molecule of which genes are made. Scientists in the UK have stored about a megabyte’s worth of text, images and speech into a speck of DNA and then retrieved that data back almost faultlessly. They say that a larger-scale version of the technology could provide an extremely dense and long-lived form of digital storage that is particularly well suited to data archiving.

As ever-greater quantities of electronic data are produced, the problem of how to store that data becomes more acute. There are many options for archiving data but all have their drawbacks. For example, hard disks used in data centres are expensive and need a constant source of electricity, and magnetic tape, while requiring no power, starts to degrade after a few years.

Neanderthal bones

In the latest research, Nick Goldman and colleagues at the European Bioinformatics Institute near Cambridge have stored digital information by encoding it in the four different bases that make up DNA. While the storage technique does not offer the convenience of random access or being rewriteable, it does have a couple of major advantages. One is its extremely high density – as a result of the information being stored at the molecular level – and the other is its durability. As Goldman points out, intact DNA has been extracted from Neanderthal bones tens of thousands of years old. “Nature has discovered that this molecule is very stable,” he says. “And we are piggy-backing on nature.”

The group used DNA that was produced in the lab rather than from inside living organisms, since the latter is vulnerable to mutation and hence data loss. But in choosing this approach the researchers had to overcome a couple of significant hurdles. One was the fact that using current technology it is only possible to make, or “synthesize”, DNA in short strings – and the shorter a string the lower is its information-carrying capacity. To get round this problem, Goldman and colleagues devised a coding scheme in which a fraction of each string is reserved for indexing purposes, specifying which file the string belongs to and at what point in the file it is located, so allowing a single file to be made up of many strings.

Encoding trits

The second challenge was how to avoid errors that occur during both writing and reading, a particular problem when neighbouring bases are of the same variety. The solution was simply to encode data in trits – digits with the values 0, 1 or 2 – and stipulate that a given trit is represented by one of the three bases not used to code the trit immediately preceding it. An additional measure was to copy the final 75% of each string into the start of the successive string.

The team tested the scheme by encoding five data files into single DNA sequences and then split those sequences up into roughly 150,000 individual strings, all 117 bases long. Fittingly, one of the files was a PDF of Watson and Crick’s famous double-helix paper – successfully encoded into double helices. The text of Shakespeare’s sonnets and an audio recording of 30 s of Martin Luther King’s “I have a dream” speech were also stored in MP3 format. The team then uploaded the encoded files to a private webpage to enable Agilent Technologies in California to synthesize the DNA. This involved using a sophisticated kind of inkjet printer to fire chemical reagents onto a microscope slide in such a way as to add one molecule at a time to a growing string of DNA, and then repeating the process to produce the thousands of strings required.

Sent as a tiny quantity of powder at room temperature and without specialized packaging, the DNA arrived in Heidelberg, Germany, at the main site of the European Molecular Biology Laboratory, of which the European Bioinformatics Institute is a part. After being put into solution the DNA was read, or “sequenced”, using a now fairly standard laboratory machine, and the resulting series of bases was then decoded on a computer to reproduce the five files. Four of the files were identical copies of the originals, while the fifth required some minor adjustment to recover its full set of data.

Video in a teacup

Goldman and colleagues claim to have achieved a density of 2 petabytes (1015 bytes) per gram of DNA which, they calculate, would allow at least 100 million hours of high-definition video to be stored in a teacup. Their DNA sample was therefore very small. “In our test tube the DNA looks like a speck of dust,” says Goldman. “In fact the sample is so small that when it arrived it looked like the test tube was empty.”

Currently the technology is too expensive to be competitive for all but the most long-term archiving. But Goldman is confident that prices will come down, given the continuing interest in DNA research. If the cost of synthesizing DNA falls by a factor of 100 over the next decade, which he says is possible, he says the technique will be as cheap as magnetic tapes for archives extending over at least 50 years. This is because unlike tapes, which need to be periodically rewritten, DNA remains unchanged as long as it is stored somewhere that is cold, dry and dark.

The current work follows similar research done last year by a team that included Sriram Kosuri of Harvard Medical School. His group used an encoding scheme that involved bits rather than trits and which included relatively little redundancy. However, he says that the two techniques are nevertheless “similar approaches to the same concept,” adding that both sets of research show DNA storage to be “approaching scales that should be of interest to investors”.

The latest research is published in Nature.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s