Many of the software tools used in studying genomes involve sequence analysis, which is one of the many subfields of computational molecular biology. The field of sequence analysis includes pattern and motif searching, sequence comparison, multiple sequence alignment, sequence composition determination, and secondary structure prediction. Because sequence data consists primarily of character strings, it's relatively easy to process the sequence entries in a flat file. Bioinformaticians use a variety of different tools to perform sequence analysis, including:
Finding these tools is pretty easy, but remembering all the command-line options for your favorites is often more difficult.
Nearly all of these tools were written to manipulate and analyze data stored in databases. Many of the most important biological databases have existed for a decade or more, making them almost ancient in this fast-moving field. The first public release of GenBank (Release 3) was in December 1982. There were 606 sequences containing 680,338 basepairs. Release 132 from October 2002 had 19,808,101 sequences containing 26,525,934,656 basepairs. SWISS-PROT has grown from 3939 protein sequences containing 900,163 amino acids (Release 2.0 in September 1986) to 101,602 protein sequences containing 37,315,215 amino acids (Release 40.0 in October 2001).
Plenty of data is available, and finding it is easy. Downloading it is almost as simple, assuming you've got a broadband Internet connection and plenty of disk space. The hard part is dealing with the plethora of flat file formats and trying to remember what their specific field codes mean. Most of us survive by either having hard copies of README files lying around or remembering exactly where to go look for something we need. The need to remember details about our favorite tools and databases prompted us to gather the information and organize it into this book.