7.4 Sequence Matching in Computational Biology

One of the most exciting application areas for clusters is bioinformatics. An enormous amount of fundamental data is becoming available in the form of sequences: either nucleotide sequences (RNA and DNA) or amino acid sequences (proteins). In both cases the data comes encoded in the form of long strings of characters. Important biological information can be extracted from this data by matching, either exactly or inexactly, single strings of characters against other strings, or, more commonly, against large databases of strings in order to find similarities. The process is like a glorified grep.

7.4.1 BLAST

Widely distributed (sequential) tools exist for matching a string, or a small file of strings, against a database of other strings. One of the most widely used is a program called BLAST [2]. BLAST can deal with both nucleotide and amino acid sequences. Here we will focus on amino acid sequences, which describe the structure of proteins. BLAST, together with many other tools, uses FASTA format. Here is a single protein in FASTA format. Each letter in the sequence represents a single amino acid.

 >sp|P28469 Alcohol dehydrogenase alpha chain (ADH).  - Macaca mulatta MSTAGKVIKCKAAVLWEVMKPFSIEDVEVAPPKAYEVRIKMVTVGICGTDDH VVSGTMVTPLPVILGHEAAGIVESVGEGVTTVEPGDKVIPLALPQCGKCRI CKTPERNYCLKNDVSNPRGTLQDGTSRFTCRGKPIHHFLGVSTFSQYTVVD ENAVAKIDAASPMEKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVG LSAVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKKPIQEVLK EMTDGGVDFSFEVIGRLDTMMASLLCCHEACGTSVIVGVPPDSQNLSINPM LLLTGRTWKGAVYGGFKSKEDIPKLVADFMAKKFSLDALITHVLPFEKINE GFDLLRSGKSIRTILTF

Files containing proteins in FASTA format are fed into a program called formatdb to create an indexed database, consisting of three files, that is structured to facilitate searching for matches. Suppose we have constructed a small database of proteins and we wish to search for substring similarities between our protein above (P28469) and the proteins in our database (the three files produced from the file 'homodimer.faa' by formatdb). Then we put our protein in a file that we might call 'homodimerstest.faa' and do:

     blastall -i homodimerstest.faa              -d homodimer.faa              -p blastp

The last parameter specifies the standard protein-protein matching algorithm. We get the following output, which lists the proteins that are similar in some way, sorted in decreasing order of similarity.

 Query= sp|P28469 Alcohol dehydrogenase alpha chain  (ADH). - Macaca mulatta  (375 letters) Database: homodimer.faa            30 sequences; 10,597 total letters Searching.done                                                          Score    E Sequences producing significant alignments:              (bits) Value sp|P28469 Alcohol dehydrogenase alpha chain  (ADH). ...   721   0.0 sp|P14139 Alcohol dehydrogenase  (ADH). - Papio hama...   682   0.0 sp|Q03505 Alcohol dehydrogenase alpha chain  (ADH). ...   623   0.0 sp|Q64415 Alcohol dehydrogenase A chain . - Geomys k...   568   e-165 sp|Q64413 Alcohol dehydrogenase A chain . - Geomys b...   568   e-165 sp|P19631 Alcohol dehydrogenase alpha chain  (ADH3)....   538   e-156 sp|P80338 Alcohol dehydrogenase I . - Struthio camelus    536   e-155 sp|P49645 Alcohol dehydrogenase I . - Apteryx australis   533   e-155 ...

We also get details about just exactly what the similarities were.

 >sp|P14139 Alcohol dehydrogenase  (ADH). - Papio hamadryas           Length = 375  Score =  682 bits (1761), Expect = 0.0  Identities = 336/375 (89%), Positives = 346/375 (92%) Query: 1   MSTAGKVIKCKAAVLWEVMKPFSIEDVEVAPPKAYEVRIKMVTVGICGTDDHVVSGTMVT 60            MSTAGKVIKCKAAVLWEV KPFSIEDVEVAPPKAYEVRIKMV VGIC TDDHVVSG +V+ Sbjct: 1   MSTAGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICRTDDHVVSGNLVS 60 Query: 61  PLPVILGHEAAXXXXXXXXXXXXXXXXDKVIPLALPQCGKCRICKTPERNYCLKNDVSNP 120            PLP ILGHEAA                DKVIPL  PQCGKCR+CK+PE NYC+KND+SNP Sbjct: 61  PLPAILGHEAAGIVESVGEGVTTVKPGDKVIPLFTPQCGKCRVCKSPEGNYCVKNDLSNP 120 ...

7.4.2 Running BLAST in Parallel

A single machine is sufficient for running small BLAST jobs, but some of the most important information is extracted by running jobs involving large numbers of sequences. One widely used database contains roughly 1.4 million sequences. Suppose we want to compare all the sequences in the database against one another. This is in some sense the largest possible BLAST job, but many other smaller BLAST jobs are still large.

Fortunately, parallelism abounds. Clusters are popular platforms in computational biology precisely because there is so much parallelism in many biologically significant computations. Moreover, most of them fit the manager/worker structure we have been discussing in this chapter. Here we will describe one way to carry out a large BLAST computation.

The plan is to use the manager/worker scheme described in the previous section with a number of changes.

The database will be distributed ahead of time to all the nodes of the cluster using either rsync or rdist.
The workers will run blastall with an input file consisting of some subset of the large number of input sequences, defined by the manager and referred to here as a chunk of input sequences. Each chunk will be sent to a worker by the manager over a socket connected to the worker.
Each subtask will consist of running blastall with a chunk of input sequences against the database.
The workers will be persistent. That is, instead of a new process being started by ssh for each subtask, each worker will remain running, exchanging messages over a socket with the manager, until the end of the job.
When a worker finishes a chunk, the output of the BLAST run will be copied to a directory of output files using scp.
The manager is not responsible for starting the workers. The manager will start off selecting only on his "listening" socket; as new workers are started (by whatever means) they connect to the manager and their sockets are added to the "select" list. Thus workers can come and go independently.
Individual worker processes can die, or nodes crash altogether, with no impact on the job as long as the manager keeps running. If a worker dies, it can be replaced by another one, which just connects to the manager on the manager's advertised listening port and joins the worker pool.
The manager keeps track of the chunks that have been assigned to workers, those that have been completed, and those that have not yet been assigned. Every time this information changes, the manager writes it to a file. Thus, the whole job can be restarted if the system crashes.

The code is not given here, but can be constructed using the Python code we used in Section 7.2.6.

The combination of allowing the pool of workers to vary in size and keeping track of exactly what work has been done contributes to the fault tolerance of this scheme: if workers or the machines they are running on fail, they can be replaced; and even if the manager dies or must be halted due to scheduling constraints, it can be restarted and pick up again where it left off.

We note that a number of useful tools for dealing with FASTA format in Python, and other Python-based tools for computational biology, can be found at www.biopython.org.