DM FOR BIOINFORMATICS
Bioinformatics is the science of storing, extracting, organizing, analyzing, interpreting, and utilizing information from biological sequences and molecules. It has been fueled mainly by advances in DNA sequencing and mapping techniques. The Human Genome Project has resulted in an exponentially growing DB of genetic sequences. KDD techniques are playing an increasingly important role in the analysis and discovery of sequence, structure, and functional patterns or models from large sequence DBs. High performance techniques are also becoming central to this task (Han et al., 2001; Han & Kamber, 2001).
Bioinformatics provides opportunities for developing novel mining methods. Some of the main challenges in bioinformatics include protein structure prediction, homology search, multiple alignment and phylogeny construction, genomic sequence analysis, gene finding and gene mapping, as well as applications in gene expression data analysis, and drug discovery in the pharmaceutical industry. As a consequence of the large amounts of data produced in the field of molecular biology, most of the current bioinformatics projects deal with the structural and functional aspects of genes and proteins. Many of these projects are related to the Human Genome Project. The data produced by thousands of research teams all over the world are collected and organized in DBs specialized for particular subjects; examples include GDB, SWISS-PROT, GenBank, and PDB. Computational tools are needed to analyze the collected data in the most efficient manner. For example, bioinformaticists are working on the prediction of the biological functions of genes and proteins based on structural data (Chalifa-Caspi. Prilusky, & Lancet1998). Another example of a bioinformatics application is the GeneCards encyclopedia (Rebhan, Chalifa-Caspi, Prilusky, & Lancet 1997). This resource contains data about human genes, their products and the diseases in which they are involved.
Since DM offers the ability to discover patterns and relationships from large amounts of data, it seems ideally suited to use in the analysis of DNA. This is because DNA is essentially a sequence or chain of four main components called nucleotides. A group of several hundred nucleotides in a certain sequence is called a gene, and there are about 100,000 genes that make up the human genome. Aside from the task of integrating DBs of biological information noted above, another important application is the use of comparison and similarity search on DNA sequences. This is useful in the study of genetics-linked diseases, as it would be possible to compare and contrast the gene sequences of normal and diseased tissues and attempt to determine what sequences are found in the diseased, but not in the normal, tissues. There are a number of projects that are being conducted in this area, whether on the areas discussed above, or on the analysis of micro-array data and related topics. Among the centers doing research in this area are the European Bioinformatics Institute (EBI) in Cambridge, UK, and the Weizmann Institute of Science in Israel.