Flylib.com

Books Software

 
 
 

What You Need to Know to Use This Book


What You Need to Know to Use This Book

This book assumes that you have some experience with Perl, including a working knowledge of writing, saving, and running programs; basic Perl syntax; control structures such as loops and conditional tests; the most common operators such as addition, subtraction, and string concatenation; input and output from the user , files, and other programs; subroutines; the basic data types of scalar, array, and hash; and regular expressions for searching and for altering strings. In other words, you should be able to program Perl well enough to extract data from sources such as GenBank and the Protein Data Bank using pattern matching and regular expressions.

If you are new to Perl but feel you can forge ahead using a language summary and examples of programs, Appendix A provides a summary of the important parts of the Perl language. Previous programming experience in a high-level language such as C, Java, or FORTRAN (or any similar language); some experience at using subroutines to break a large problem into smaller, appropriately interrelated parts ; and a tinkerer's delight in taking things apart and seeing what makes them tick may be all the computer-science prerequisites you need.

This book is primarily written for biologists, so it assumes you know the elementary facts about DNA, proteins , and restriction enzymes; how to represent DNA and protein data in a Perl program; how to search for motifs; and the structure and use of the databases GenBank, PDB, and Rebase. Because the book assumes you are a biologist, biology concepts are not explained in detail in order to concentrate on programming skills.

Biological data appears in many forms. The most important sources of biological data include the repository of public genetic data called GenBank (Genetic Data Bank) and the repository of public protein structure data called PDB (Protein Data Bank). Many other similar sources of biological data such as Rebase (Restriction Enzyme Database) are in wide use. All the databases just mentioned are most commonly distributed as text files, which makes Perl a good programming tool to find and extract information from the databases.


Organization of This Book

Here's a quick summary of what the book covers. If you're still relatively new to Perl you may want to work through the chapters in order. If you have some programming experience and are looking for ways to approach problems in bioinformatics with Perl, feel free to skip around.

Part I

Chapter 1

Modules are the standard Perl way of "packaging" useful programs so that other programmers can easily use previous work. Such standard modules as CGI, for instance, put the power of interactive web site programming within reach of a programmer who knows basic Perl. Also discussed in later chapters are Bioperl, for manipulating biological data, and DBI, for gaining access to relational databases. Modules are sometimes considered the most important part of Perl because that's where a lot of the functionality of Perl has been placed. In this chapter I show how to write your own modules, as well as how to find useful modules and use them in your programs.

Chapter 2

Complex data structures and references are fundamentally important to Perl. The basic Perl data structures of scalar, array, and hash go a long way toward solving many (perhaps most) Perl programming problems. However, many commonly used data structures such as multidimensional arrays, for instance, require more sophisticated Perl data structures to handle them. Perl enables you to define quite complex data structures, and we'll see how all that works.

String algorithms are standard techniques used in bioinformatics for finding important data in biological sequences; with them, you can compare two sequences, align two or more sequences, assemble a collection of sequence fragments , and so forth. String algorithms underlie many of the most commonly used programs in biology research, such as BLAST. In this chapter, a string matching algorithm that finds the closest match to a motif, based on the technique of dynamic programming, is presented in the form of a working Perl program.

Chapter 3

Object-oriented programming is a standard approach to designing programs. I assume, as a prerequisite, that you are familiar with the programming style called declarative programming. (For example, C and FORTRAN are declarative; C++ and Java are object-oriented; Perl can be either.) It's important for the Perl programmer to be familiar with the object-oriented approach. For instance, modules are usually defined in an object-oriented manner.

This chapter presents , step by step, the concepts and techniques of object-oriented Perl programming, in the context of a module that defines a simple class for keeping track of genes.

Chapter 4

In this chapter, object-oriented programming is further explored in the context of developing software to convert sequence files to alternate formats (FASTA, GCG, etc.). The concept of class inheritance is introduced and implemented.

Chapter 5

This chapter further develops object-oriented programming by writing a class that handles Rebase restriction enzyme data, a class that calculates restriction maps, and a class that draws restriction maps.

Part II
Chapter 6

Relational databases are important in programming because they save, organize, and retrieve data sets. This chapter introduces relational databases and the SQL language and includes information on designing and administering databases. I take a close look at how one such relational database management system, the popular MySQL, is used from the Perl language.

Chapter 7

Web programming is one of Perl's areas of strength. In this chapter, I start an example that puts a laboratory up on the Web using Perl and the CGI module. The software developed in previous chapters for restriction mapping is made accessible from the Web.

Chapter 8

Using computer graphics to display data is one of the most important programming skills in bioinformatics. In this chapter, graphics programs are used to dynamically display the output of restriction maps and data presented as graphs on the Web. The Perl module GD is discussed and used to generate maps on the fly from web page queries.

Chapter 9

Bioperl is a set of modules used by Perl programmers to write bioinformatics applications. In this chapter you'll see an introduction of the Bioperl project. Bioperl is open source (free under a very nonrestrictive copyright) and developed by a group of volunteers, many based in supportive research organizations. In recent years it has achieved critical mass and is now adequately documented and fairly broad in scope. If you do Perl bioinformatics programming, you should certainly be aware of what Bioperl has to offer, to avoid reinventing the wheel.

Part III

Appendix A

This appendix summarizes the parts of Perl we've covered.

Appendix B

This appendix outlines how to install Perl.