DNAMassive Data, Massive Value | Inescapable Data: Harnessing the Power of Convergence (paperback)

DNA is the information molecule of life. "Interestingly, the DNA of any two individuals on Earth is actually 99.9% alike," says Dr. Phil Reilly, CEO of Interleukin Genetics. "There are 3.1 billion base pairs in the human genome," continues Dr. Reilly, "yet a mere 0.1% of 6.2 billion is 6.2 million differences, which explains our remarkable distinctions." The study of those variations leads to the discovery of genetic predisposition to various diseases, and hence the great interest in mapping the human genome and studying it intimately.

"Suppose you and I compare our Gene A," adds Dr. Reilly, "and we find that two or three places out of 30,000 bits [for a gene] are slightly different. Perhaps you have a greater incidence of heart disease, and it is related to those bit differences affecting the functionality of a protein that is coded for that gene and now has a different efficiency of some metabolic pathway downstream." Through extremely computer-laborious number crunching and cross-analysis, key associations are learned that in turn lead to new drug therapies or treatment protocols, only possible through the power intrinsic in the data and the ability to analyze it.

DNA Analysis and Anti-Spam

In a rather odd twist, computation biologists at IBM's Watson Research Center have created an anti-spam filter based on tools and techniques used to analyze genetic sequences.^[1] E-mail spam is a growing and annoying problem for most everyone. Companies and people creating the spam are becoming increasingly clever in how they create e-mails to avoid the commonplace filters. Part of the challenge is that it is considered a far worse problem to erroneously block valid e-mail. Therefore, spammers attempt to add extra text in messages and subject headers to make their e-mail seem legitimate and desired. IBM's Bioinformatics Research Group started adapting some of their bio pattern-matching algorithms for e-mail analysis back in 2003. The anti-spam algorithm grew out of an algorithm called Teiresias, which researchers were using for protein annotation. Determining the properties of a protein (such as function and structure) turns out to be similar to analyzing and understanding e-mail messages.

The researchers trained the algorithm on nearly 100,000 messages. Over time, the algorithm learned similarities of a dollar sign ($) and the letter (S), because spammers will often encode messages like this: "$ave big on new enhancing pills." In tests, more than 97 percent of true spam e-mails are recognized through this new and unconventional approach. The advances in computing and algorithms for gene sequencing and analysis turn out to have a valid deployment in more commonplace areas. Inescapable Data and networking allow such cross-uses to be made and the values applied to broader areas of our lives.

^[1] http://news.bbc.co.uk/1/hi/technology/3584534.stm.

Sequencing of the human genome cost approximately $3 billion and took nearly 15 years. The amount of data this project produced is staggering. There are approximately 25,000 genes hidden among the 6 billion bases to express the "generic" human genome. The first step was to acquire the data. But, that process provides only a generic set of data for a species (useful, to be sure, for detailed analysis of individual genes and events). However, genetics researchers now project that within as few as 5 to 10 years, they will be able to map an individual's personal genome and do so at a cost of as little as a $1,000 per individual.

What is the value of handing someone $1,000 to map your own personal genome? Some medical researchers believe that knowing your personal genomecarried with you, perhaps on your PDA or ID cardwill allow your doctor to prescribe customized medication regimens to more effectively treat specific diseases.

"There is a particular drug for lung cancer, Iressa, that for most of the people taking the drug it is of little or no value," Dr. Reilly explains. "But, for a small number who have a particular gene variant, the drug is of great value for some reasons of protein interaction that we don't fully understand. If we could merely identify ahead of time which people would react perfectly to this drug, we have achieved a major advance in medicine. We could now choose a drug that matches your particular genome and we enter an age of personalized medicine." Reaching this conclusion was only possible through massive information sifting.

"We should soon be able to screen for gene variants which will signify risk for many more disorders. This would allow early interventions to avoid becoming ill at all," exclaims Dr. Reilly. Dr. Reilly believes that with fast mapping and identification of individual genomes, people could potentially avoid contracting thousands of diseases and that this could be done at a very early stage in life, but we need many more genome mappings and analysis (without waiting 15 years and spending another $3 billion).

Today, individual computers available for use in clinical settings do not exist that can process massive quantities of data in any reasonable amount of time. Your laptop, for example, has less than half a terabyte of disk space, and it would take your computer three hours just to read and digest every byte of your 3.1 billion base pairs before it could even begin to process it. To exploit this gene-centric therapy opportunity, the computer world will have to embrace many changes. One such discussed change is the move to "federated computing," where a great many computers work together on very small parts of a large problem and then make the results generally available. In the Inescapable Data world, the value of information is in its sharing.

Fortunately, there is a slow but steady movement in the medical field to allow data to be shared as information (in self-describing XML or other techniques). As a citizen concerned about your own longevity, would you be willing to make available your genomein its totality or in piecesfor analysis by a large faculty of university-based researchers, or even a young but promising college level pre-med student with a theory on the best way to quantify prostate cancer risk? Would you make available a fraction of your computer's CPU time and resources as a participant in a global federation of computers dedicated to finding genetically linked cures for certain diseases? In the world of Inescapable Data, both are distinct possibilities.

Simply restated, our Inescapable Data view is that, while analyzing relationships among data elements themselves is an interesting and useful endeavor, it is the analyzing of relationships between disparate data sets that nets new and possibly more significant values. Suppose, for example, you correlate the electronic records of your visits to the supermarket with your genome and with the genomes, diets, and disease histories of 2 billion other people. You might discover some hidden dangers lurking on those supermarket shelves that could be avoided. What then happens when you correlate those results with the frequency of your visits to the health club? (Okay, we already know that answer.) Impossible? Right now, today, yes, it is not possible due to our lack of total electronic record keeping. But the data is starting to be gathered today, and the connections are being made today that will allow such analyses to be made in the near future.

Data Collected About You

How farfetched is it that someone today could analyze your food consumption against your workout schedule? Nearly every supermarket now accepts bank debit and credit cards for purchases. With the high prices of food, a typical family weekly shopping trip costs more than $200. Many people are reluctant to carry such cash with them and opt for the convenience of bank cards. As such, your weekly shopping market knows your name and every box of cereal and loaf of carbohydrate-laden bread you purchase. As for the health club, even YMCAs today are using electronic ID badges for daily entry, and these badges track every time you enter and leave a facility. Furthermore, newer workout machines also enable you to swipe your card through the exercise equipment itself to better track or help you adhere to your workout schedule. So, there is already a meaningful amount of health data being collected about you that through analysis could be correlated to a variety of other common databases, such as geography, socio-economics, prevailing weather, and perhaps now your personal medical information. The valuable difference the Inescapable Data world brings is the interaction of personal and detailed databases that allows for new insights.

The famous Framingham Heart Study took 20 years before researchers who studied and gathered data from thousands of individuals produced their findings. That data is now available in some electronic form for other researches to exploit and attempt to correlate with other data streams. Such studies are taking place continuously in the medical world, and increasingly their details and results are increasingly being made available for wider exploitation. Web connectivity and standard electronic records for describing the data (such as XML-expressed information) allow for more rapid integration into other research. What once would have taken a team of programmers weeks to decode via a proprietary database can now be done nearly instantly and even by average citizens (much like average citizens can look up real estate values in their town or search for a blender at WalMart.com). Furthermore, just finding the interesting databases is dramatically simpler due to search tools and a new emphasis on sharing within the medical communities. In the world of Inescapable Data, it is highly likely that the data needed to do meaningful medical research already exists somewhere. As a researcher or even as an individual, you can tap into it and make correlations.