Copyright © 2003 O'Reilly & Associates, Inc.

Printed in the United States of America.

Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( For more information, contact our corporate/institutional sales department: (800) 998-9938 or

Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a liger and the topic of sequence analysis is a trademark of O'Reilly & Associates, Inc.

Material in Chapter 3 (SWISS-PROT) and Chapter 5 (PROSITE) is used with the permission of the Swiss Institute of Bioinformatics. Material in Chapter 8 (BLAT) is used with the permission of Jim Kent. Material in Chapter 10 (HMMER) is used with the permission of Sean Eddy. Material in Chapter 11 (MEME/MAST) is used with the permission of Michael Gribscov and Tim Baily.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.


Gene sequence data is the most abundant type of data available, and there is a rich array of computational methods and tools that can help analyze patterns within that data. This book brings together the detailed terms, definitions, and command-line options found in the key databases and tools used in sequence analysis. It's meant for use by bioinformaticians in both industry and academia, as well as students. This book is a handy resource and an invaluable reference for anyone who needs to know about the practical aspects and mechanics of sequence analysis.

It's no coincidence that the gene sequences of related species of plants, animals, and microorganisms show complex patterns of similarity to one another. This is one of the most fascinating aspects of the study of evolution. In fact, many molecular biologists are convinced that an understanding of sequence evolution is the first step toward understanding evolution itself. The comparison of gene sequences, or biological sequence analysis, is one of the processes used to understand sequence evolution. It is an important discipline within computational biology and bioinformatics.

If you're new to the field, this book won't teach you how to perform sequence analysis, but it will help you sort out the details of the common tools and data sources used for sequence analysis. If sequence analysis is part of your daily lives (as it is for us), you'll want this easy-to-use book on your desk. We've included many references (especially URLs) for further information on the tools we document, but with this book handy we hope you won't need to use them.

Sequence Analysis Tools and Databases

Many of the software tools used in studying genomes involve sequence analysis, which is one of the many subfields of computational molecular biology. The field of sequence analysis includes pattern and motif searching, sequence comparison, multiple sequence alignment, sequence composition determination, and secondary structure prediction. Because sequence data consists primarily of character strings, it's relatively easy to process the sequence entries in a flat file. Bioinformaticians use a variety of different tools to perform sequence analysis, including:

  • Standard Unix tools (e.g., the grep family, sed, awk, and cut).

  • Publicly available tools (e.g., BLAST, the EMBOSS package).

  • Open source libaries (e.g., BioPerl, BioJava, BioPython, BioRuby).

  • Custom tools.

Finding these tools is pretty easy, but remembering all the command-line options for your favorites is often more difficult.

Nearly all of these tools were written to manipulate and analyze data stored in databases. Many of the most important biological databases have existed for a decade or more, making them almost ancient in this fast-moving field. The first public release of GenBank (Release 3) was in December 1982. There were 606 sequences containing 680,338 basepairs. Release 132 from October 2002 had 19,808,101 sequences containing 26,525,934,656 basepairs. SWISS-PROT has grown from 3939 protein sequences containing 900,163 amino acids (Release 2.0 in September 1986) to 101,602 protein sequences containing 37,315,215 amino acids (Release 40.0 in October 2001).

Plenty of data is available, and finding it is easy. Downloading it is almost as simple, assuming you've got a broadband Internet connection and plenty of disk space. The hard part is dealing with the plethora of flat file formats and trying to remember what their specific field codes mean. Most of us survive by either having hard copies of README files lying around or remembering exactly where to go look for something we need. The need to remember details about our favorite tools and databases prompted us to gather the information and organize it into this book.