Life as Digital Information


James Watson, one of the two scientists credited with the discovery of the genetic code, was once asked by a journalist to summarize the significance of his discovery in a single sentence. Watson thought hard for a moment, and then replied, "All life is digital information."


As members of the 21st century, we are deeply embedded in a digital world.  From the music playing on our MP3 players to checking our e-mail, we have become increasingly dependent on digital information. When we check e-mail, we take for granted what is happening behind the scene within the buzzing box on our desk.  Recall that a computer relies on two essential components; information and a way to interpret that information and translate it into function. A computer stores information in bits, each of which is a simple distinction between "on" and "off," or "yes" and "no." With a sufficient number of bits, a computer can represent and store large numbers, images, or words in a language. Once the computer has this information it is useless unless it also has rules for interpreting and manipulating the information. As users, we install software or programs onto our computer that serve as the rules for proper use of the information. 


Language is just one way to convey information. In the English language, the smallest unit of information is a letter.  However, the smallest unit of useful information is a word. Combining our simple list of twenty-six letters in a variety of ways will make many different words, but not all of the possible combinations of letters are allowed.  In fact, the average person only uses 3500 to 4000 words to communicate.  Just as random letters don’t always create a meaningful words, neither do a string of words necessarily convey information.  It is the rules of language that subsequently dictate how words can be combined.  The results can be as beautiful and varied as a haiku poem or a long novel.


Life, by nature, is very complex.  Just like a novel consists of thousands of words that are created from a limited alphabet of letters, the instructions for life are written using an alphabet of just four letters.  These letters are simply molecules that we call bases:  adenine (A), cytosine (C), guanine (G), and thymine (T).  Each base is combined with a sugar molecule and a phosphate molecule to make a larger molecule that is called a nucleotide.  The nucleotides bind together to make a long string.  Two of these strings can then line up side by side.  The bases of the nucleotides in one string bind to the bases in the other string, like the rungs that join the two sides of a ladder together.  The structure of the DNA molecule, this double-stranded ladder that twists like a spiral staircase, was the major discovery credited to James Watson and Francis Crick in the 1950s.


This simple alphabet of just four letters can easily be represented digitally as “bits”.  For instance, if we had just two letters we could represent those two letters in two bits with the numbers 0 and 1.  However, in this case we have four letters.  We can use a combination of 1s and 0s to represent the four letters of our alphabet as follows: 


A = 1 1


T = 1 0


C = 0 1


G = 0 0



Now that we have an alphabet, we need a set of rules to help make sense of the alphabet.  The genetic code that governs all life has only a couple of rules. The first rule concerns the actual structure of the molecules that make up DNA.  The shape of the molecules A, T, G, and C means that they can only combine to make the rungs of the ladder in a few ways — adenine and thymine can combine with each other, and cytosine and guanine can combine with each other. Other pairings are highly unlikely. So the four possible rungs of the DNA ladder are AT, TA, CG, and GC. This means that one side of the ladder is complementary to the other; you can predict the string of molecules comprising one side of the ladder if you know the other side.  The other rule relates to the way the information contained in a DNA molecule is read. If we were to read only one side of the ladder at a time, from top to bottom, we wouldn’t be sure where one word ends or another begins.  Life eliminates this problem by reading the ladder three nucleotides at a time; each word consists of only three letters.  We call each three-letter word a codon.  With only four letters, there are 4 × 4 × 4 = 64 ways that they can combine in groups of three. Remember that each letter can be translated into two bits of information.  If there are three letters in each codon, then a single codon represents 2 × 3 = 6 bits of information.


We learned in basic biology that genes are the molecules in our cells that serve as the blueprint for life.  Genes are made up of DNA, and can therefore be thought of as a sequence of codons.  We can think of a gene as analogous to a paragraph in the English language. Each gene contains several thousand bits of information.  A human has about 30,000 genes. Simple organisms have far fewer. The total number of base pairs (rungs of the ladder) in our DNA is about 3 billion, or 3 × 109.  If we wanted to, we could calculate the digital information content of the human genetic code. Each base is two bits, so the information content is 2 × 3 × 109 = 6 billion bits. Recall that 8 bits equals one byte. So the information content is (6 × 109)/8 = 7.5 × 108 bits, or 750 megabytes — equal to the information contained in 500 books or two hours of music on a compact disk. You could easily store the information of the human genetic code on a personal computer or small memory card… in your pocket!