BioinformaticsPharmaceuticals


Bioinformatics/Pharmaceuticals

"The pharmaceutical industry can no longer count on blockbuster drugs," says Kris Joshi, global strategy executive for IBM's Healthcare and Life Sciences division. "The model has to change from hunting elephants to hunting game. $800 million today for drug discovery brought through to market is far too high and needs to come down to the $100 million arena. Most of the large-scale diseases have drugs (to treat them with) now, and so the shift needs to be to somewhat narrower opportunities. Technology and changes to how technology is used will get us there."

Following the runup in the bioinformatics industry in the year 2000, analysts started predicting that data-driven drug discoveries would come easier and faster. "People saw so much data from genome-scale work that they figured with that vast amount of data there must be answers," Joshi explains. The reality was different. No doubt that bioinformatics is heavy on data and data analysis, and that is why most of the premiere research has been done using large super computers. However, big data plus big computers does not necessarily equal bigger-than-life results. "Two things need changing and we're starting to see the beginnings," explains Joshi, "a move toward federated computing and a move to openness and self-describing data."

Computers have never scaled linearly with regard to cost complexity against performance. That is, a single CPU computer costs a mere fraction of a four-CPU one. The cost disparity gets even larger as more CPUs are added (and memory and disk). Part of the reason is related to the interconnect technologies (between the CPUs) and partly because the market audience is far smaller. Bioinformatics has been a solid consumer of the highest horsepower machinery around, which has also positioned the opportunities only toward the largest bio players and around the largest elephants. A trend that, according to Joshi, needs to change if we want to see a broader set of solutions for medical problems.

Federated Computing

What is federated computing? The pioneer program in federated computing was the 1996 SERENDIP (Search for Extraterrestrial Radio Emissions Nearby Developed Intelligent Populations) project that originated from the University of California, Berkley. Thousands of Internet-connected enthusiasts volunteered part of their computer's time to searching massive databases of accumulated signals from space in an effort to find a minute signal in a monumental sea of static hissa signal that could be identified as originating from some form of extraterrestrial life. Volunteers were sent portions of radio-emissions data and specialized software to analyze the data. The results were then collected back from each individual volunteer and aggregated. Although only a hint of a possible extraterrestrial signal was subsequently identified, the success of this initial SETI project gave rise to many other large-scale distributed computing projects.

More recently, Stanford University inaugurated the Folding@Home project[2] to model protein "folding." Understanding how proteins fold (or misfold) can lead to discoveries related to specific diseases and related medical treatments. The largest super computer available today would still take 30 years to complete a simulation. Fortunately, the process of simulating protein folding can be fairly easily broken up in to separate pieces, making it ideal for federated computing. The Folding@Home project currently boasts that more than one million CPUs throughout the world have participated in their folding simulations; people simply visit the Web site and download the software to their home PC. Each PC then processes each piece in its "spare" time, transparently, and transmits the results back to the main organizing system. With Inescapable Data comes pervasive computing. Federating literally millions of individual computers into a huge processing complex is a technique that we think will bring solutions to some of medicine's most daunting problems.


[2] http://www.stanford.edu/group/pandegroup/folding/.

Computers are used for primarily two different major contributions in the bioinformatics and pharmaceutical space. One common use is in massive data analysis, collection, and reduction. Terabyte data sources are common and need to be combined and analyzed with other similar-sized sources in processes that take days of super-computer computing. The second common use is for actual simulation of biological activities, known as in-silico modeling, that can save years of experiments and costly and dangerous field trials. Although many problems in either category can exploit federated computing, many other problems cannot and require computing power that challenges the imagination. Tightly coupled clusters are a common approach wherein high-end servers from IBM, Sun, SGI, or HP are tied directly together as a single large machine, often with 2,000 processors working in concert. As large as those systems are, they are still not powerful enough for the more daunting data-based problems. A teraflop (a trillion mathematical floating-point operations per second) is not enough, and the industry is looking for petaflop-size machines (1,000 times more powerful than a teraflop).

Enter the much-talked-about grid computing, which connects together groups of machines within or between various institutions and across different computer manufacturer models. Part of the challenge of a commercial or research institution is in balancing their spotty computer needs against their bank account. Typically, institutions under-buy their computer horsepower because there are gaps in usage. As a consequence, many analyses take months instead of perhaps a week (a week if they had access to the highest-power machinery during the needed interval). Sun's Sun One Grid Engine (among similar offerings from competitors) enables organizations to pool various high-end computing resources under one management umbrella and more efficiently exploit the power. It is this sort of cooperation and sharing that is fundamental in the Inescapable Data world and nets us faster time to important discoveries.

The pharmaceutical industry is changing, driven partly by the lack of elephants and partly by more ingenious and enterprising players. Keep in mind that because we now all live far longer than ever before, we are less likely to die from some common mass-scale disease (for which drugs have been discovered) and will instead die of a more narrow element of a disease or combination of diseases. This is propelling the industry toward change, and several elements are in place leading to a synergy:

  • The ability to combine separate computers for a single task (federations, clusters, and grids) on an unprecedented scale

  • Open source in bioinformatics and related arenas

  • Data sharing via descriptions

Although part of the industry is very much proprietary, secret, and closely guarded (due to the historically high value of elephant-level drug discoveries and the costs of getting there), we will find such companies are more the dinosaurs than the norm going forward. Universities and smaller research institutions are finding power in banding together and working common elements of different problems...somewhat of an outsourcing model but with a bit of a research flavor. To the participating companies, they get to leverage hardware infrastructure and some core amount of analytical analysis. This then frees them up to focus on their particular drug-discovery details.

The open-source movement in the bio/pharma spaces is particularly interesting. Typical bioinformatic commercial software can cost as much as a million dollars or more for a site license. As a result of this extreme cost, as much as 25 percent of all software used in the life-sciences field is now created in the open-source (free) community,[3] far higher than in any other discipline. Quite likely, the adoption and use of open source will continue. The heart of bioinformatics is data, and data is being collected and discovered (via analysis) at historically high levels. Shifting to a model of higher cooperation and information and tool sharing should yield more discoveries.

[3] http://www.lifescienceit.com/biwaut02secrecy.html.

CDISC, Clinical Data Interchange Standards Consortium (www.cdisc.org), is a nonprofit organization supported by major industry players to allow for better information sharing and exchange. CDISC is based on XML and other standards and allows a wide range of data related to clinical and nonclinical work to be readily available for consumption by the FDA and other interested parties. The formats, like with any XML implementation, are vendor neutral and both machine and human readable, allowing for faster integration and usage. In the Inescapable Data world, more and more intercompany groups are forming for the purpose of paving faster pathways to information usage and sharing, particularly if some government approval or regulatory step is involved. We should expect such a level of cooperation, openness, and standards in other industries as well for the same economic and time-efficiency reasons.



    Inescapable Data. Harnessing the Power of Convergence
    Inescapable Data: Harnessing the Power of Convergence (paperback)
    ISBN: 0137026730
    EAN: 2147483647
    Year: 2005
    Pages: 159

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net