5.4 Drawing Restriction Maps


One of the most important lessons of scientific programming is the importance of a good display of a program's results.

In this section, I'm going to add the ability to output a restriction map by inheriting Restriction.pm and making a new derived class Restrictionmap.pm . The restriction map will be shown very simply as the sequence printed in lines of simple text. The locations of restriction sites are written over the lines of text, giving the names of the restriction enzymes at the restriction sites. In Chapter 8, this simple text-based graphic output is replaced by a real picture (with colors, different fonts, and whatever bells and whistles you choose to add). I designed my base class Restriction.pm to represent the restriction map as a simple list of recognition site locations because I wanted my software to be flexible enough to be extended to accommodate any of the many different graphic formats that might be desired.

The difference between an unreadable mass of data, and a good clean graphic display that leads the eye towards an interesting result, is profound. It's the difference between a successful program and a dud, between hours or days spent sorting through columns of data and a quick discovery of a region of interest. It's even the difference between a scientific discovery, and none at all.

Graphics programming is a bit of an advanced topic. (You'll get your feet wet in Chapter 8.) But even if you need a program that's restricted to text output, you still need to spend programming time displaying that output in a useful manner. So, in this section, I'll add a very simple map drawing capability to the software, drawn as simple text.

5.4.1 Storing Graphics Output in an Attribute

I want to display a graphic. Do I need to add _graphic and _graphictype attributes to my object? The question boils down to: shall I compute and display a graphic whenever needed, or shall I compute a graphic and store it in an attribute? If you're new to computer graphics, just think of a graphic as a mass of data, which can be stored in a variable, or in a file on disk, and can be used to generate a graphical display on the computer screen by the right software.

I do already compute the restriction map, by which I mean the actual locations of the recognition sites for each enzyme requested , and stored it in the _map attribute. I can also compute a graphic at the same time and store it in the proposed _graphic attribute.

Let's think about the pluses and minuses of storing graphics output in an attribute.

I'm not sure which graphics system I'll use in the future; for now, a simple text output may work as a stored attribute, but there are a lot of graphics outputs possible. Also, it seems likely that I'll be able to compute the graphics output very quickly, so the need for storing it is less compelling. And storing a large image for possibly hundreds or thousands of objects will be a strain on the computer system.

However, I may not be able to calculate quickly for fancier, full color , high resolution graphics that I might want in the future. And perhaps I'll need to flip between graphics very quickly, in which case having them precalculated would be a necessity!

Also, how about multiple digests? Should I just create one graphic for the entire set of enzymes in an object or add a single digest graphic for each enzyme?

Maybe I can't answer all these questions yet, either. At this point, I'm not sure of the kind of graphics I'll be producing, or of their size, or of the time to generate them. Perhaps I can decide to make just one graphic per object (the user can always make a new object with just one enzyme if desired).

Answering these questions is a judgment call. When the answers to such questions aren't known, you should at least try to make the code flexible enough to accommodate whatever decisions are eventually reached.

For now, let's add a graphics attribute and precalculate the graphics output. After all, if in the future I decide that, for instance, I can generate graphics quickly enough on the fly, but that they're a bit large, I'll just add a method that can generate the images when needed.

5.4.2 The Restrictionmap Class

Here's a simple extension of the Restriction class I'll call Restrictionmap , which adds a new _mapgraphics attribute. It doesn't compute the graphics output until the first time it's called; it then stores the result for future lookups:

 package Restrictionmap; use base ( "Restriction" ); # # A class to find locations of restriction enzyme recognition sites in #  DNA sequence data, and to display them. # use strict; use warnings; use Carp; # Class data and methods {     # A list of all attributes with default values.     my %_attributes = (         #    key   = restriction enzyme name         #    value = space-separated string of recognition sites => regular expressions         _rebase      => { },    # A Rebase.pm object         _sequence    => '',     # DNA sequence data in raw format (only bases)         _enzyme      => '',     # space separated string of one or more enzyme names         _map         => { },    # hash: enzyme names => arrays of locations         _graphictype => 'text', # one of 'text' or 'png' or some other         _graphic     => '',     # a graphic display of the restriction map     );     # Return a list of all attributes     sub _all_attributes {         keys %_attributes;     } } 

Notice that I've declared a new class Restrictionmap using the base class Restriction . Notice also that no AUTOLOAD mechanism is provided.

Also, in the block containing the class data and methods, there's a new definition of the hash %_attributes that adds the new attribute _graphictype to indicate the type of graphic (at first, this will be " text "), and the new attribute _graphi c as a simple scalar.

Why am I defining it as a scalar? As you'll see, the routines that make the text-based graphics that I'm adding here assemble the graphics output as an array. To store it as a scalar requires an extra call to the Perl join function.

The reason I'm storing the graphics display as a scalar is that image data comes in a variety of formats. If I plan on just saving an image as a scalar, it will perhaps cover the most common possibilities; it's always possible to read a file into a scalar in Perl. Hopefully, this decision to store the image data as a scalar gives the most flexibility for future development.

This opening block is similar to the opening block of the base class Restriction . Notice the only helper method I'm redefining is the _all_attributes method, which uses the redefined %_attributes data structure that appears in the same block with it. The other methods in this block in the base class count the number of objects in existence during the running of the program; those methods and the _count variable they depend on are inherited by this new derived class Restrictionmap and will work fine.

5.4.2.1 Adding graphics capability to the class

Next , come the methods that add the graphics capability to the class:

 sub get_graphic {     my($self) = @_;     # If the graphic is not stored, calculate and store it     unless($self->{_graphic}) {         unless($self->{_graphictype}) {             croak 'Attribute graphictype not set (default is "text")';         }         # if graphictype is "xyz", method that makes the graphic is "_drawmap_xyz"         my $drawmapfunctionname = "_drawmap_" . $self->{_graphictype};         # Calculate and store the graphic         $self->{_graphic} = $self->$drawmapfunctionname;     }     # Return the stored graphic     return $self->{_graphic}; } 

This is the public method that is meant to be called from a program. Recall that I decided to only calculate the graphics for an object the first time the graphic is requested; it's then saved in the _mapgraphic attribute for subsequent calls.

5.4.2.2 Creation of the graphic

The first step in graphic creation is an initialization routine. This doesn't do much here, but other graphics I might create may well benefit from a separate initialization step, so I plan ahead and add it here as well.

Each enzyme is called in turn , and its matching positions are retrieved as you've seen before with the get_enzyme_map method. The appropriate annotation is added to the @annotation array. The resulting annotation is then formatted for output and returned:

 # # Methods to output graphics in text format # sub _drawmap_text {     my($self) = @_;     my @annotation = ();     push(@annotation, _initialize_annotation_text($self->get_sequence));     foreach my $enzyme ($self->get_enzyme_names) {         _add_annotation_text(  annotation, $enzyme, $self->get_enzyme_map($enzyme));     }     # Format the restriction map as sequence and annotation     my @output = _formatmaptext(50, $self->get_sequence, @annotation);     # Return output as a string, not an array of lines     return join('', @output); } #  Make a blank string of the same length as the given sequence string sub _initialize_annotation_text {     my($seq) = @_;     return '' x length($seq); 

The idea behind this text-based annotation is as follows . The lines of sequence output are separated by blank lines just for readability. Directly above the sequence lines are additional annotation lines containing the names of enzymes, which appear with their names starting exactly where the recognition site starts. Sometimes, enzyme names collide ”two or more enzyme names might require the same space on an annotation line. In that case, an additional annotation line is created to print these overlapping enzyme names. There's an example of this in the program output following this discussion.

The code is fairly well commented. Check out the exercises: one will challenge you to improve it or to start from scratch and invent a better way to display this kind of text-based annotation. I'll omit a detailed commentary on exactly how this code accomplishes its job; as they say, it will be left as an exercise for the reader. Or, as one of my mathematics professors used to delight in saying, "A moment's reflection should clear the matter up for you."

 #   Add annotation to an annotation string sub _add_annotation_text {     my($array, $enz, @pos) = @_;     # $array is a reference to an array of annotations     # Put the labels for the enzyme name at the correct positions in the annotation     foreach my $location (@pos) {         # Loop through all the annotation strings as necessary         for( my $i = 0 ; $i < @$array ; ++$i ) {             # If the annotation contains only space characters at that position,             # insert the annotation             if(substr($$array[$i], $location-1, length($enz)) eq (' ' x length($enz))){                substr($$array[$i], $location-1, length($enz)) = $enz;                last;             # If the annotation collides, add it to the next annotation string on the             # next iteration of the "for" loop.             # But first, if there is not another annotation string, make one             }elsif($i == (@$array - 1)) {                 push(@$array, _initialize_annotation_text($$array[0]));             }         }     } } # Sequence with annotation lines formatted for the page with line breaks sub _formatmaptext {     my($line_length, $seq, @annotation) = @_;     my(@output) = ();     # Split strings into lines of $line_length     for ( my $pos = 0 ; $pos < length($seq) ; $pos += $line_length ) {         # Print annotation on top of sequence, using reverse         foreach my $string ( reverse ($seq, @annotation) ) {             # Discard blank lines?             # if ( substr($string, $pos, $line_length) !~ /[^ \n]/ ) {             #     next;             # }             # Add line to output             push(@output, substr($string, $pos, $line_length) . "\n");         }         # separate the lines         push(@output,"\n");     }     # Return the merged annotation and sequence     return @output; } =head1 Restrictionmap Restrictionmap: Given a Rebase object, sequence, and list of restriction enzyme     names, return the locations of the recognition sites in the sequence =head1 Synopsis     use Restrictionmap;     use Rebase;     use strict;     use warnings;     my $rebase = Rebase->new(         dbmfile => 'BIONET',         bionetfile => 'bionet.212'     );     my $restrict = Restrictionmap->new(         rebase => $rebase,         enzyme => 'EcoRI',         enzyme => 'HindIII',         sequence => 'ACGAATTCCGGAATTCG',         graphictype => 'text',     );         print "Locations are ", join ' ', $restrict->get_enzyme_map('EcoRI'), "\n";     print $restrict->get_graphic; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut 1; 
5.4.2.3 Running the program

If you run the short program given in the Synopsis section of the documentation, you won't get very interesting output; the input data is too short. The following is a better example. There's some DNA sequence data that contains EcoRI and HindIII recognition sites, saved in a file called sampleecori.dna . The class SeqFileIO is used to read it in, and then Restrictionmap makes and displays a restriction map of that sequence.

Here's the program, an extension, really, of the short program that appears in the Synopsis section of the POD documentation. Let's use it to test this new Restrictionmap class:

 use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Restrictionmap; use Rebase; use SeqFileIO; use strict; use warnings; my $rebase = Rebase->new(     dbmfile => 'BIONET',     bionetfile => 'bionet.212',     mode => '0666', ); my $restrict = Restrictionmap->new(     rebase => $rebase,     enzyme => 'EcoRI HindIII',  # GAATTC # AAGCTT     sequence => 'ACGAATTCCGGAATTCG',     graphictype => 'text', ); print "Locations are ", join ' ', $restrict->get_enzyme_map('EcoRI'), "\n"; print $restrict->get_graphic; ## Some bigger sequence my $biggerseq = SeqFileIO->new; #$biggerseq->read(filename => 'map.fasta'); $biggerseq->read(filename => 'sampleecori.dna'); my $restrict2 = Restrictionmap->new(     rebase => $rebase,     enzyme => 'EcoRI HindIII',  # GAATTC # AAGCTT     sequence => $biggerseq->get_sequence,     graphictype => 'text', ); print "\nHere is the map of the bigger sequence:\n\n"; print $restrict2->get_graphic; 

Notice that a use lib directive is added to tell Perl where to find the modules; you can accomplish the same thing with the PERL5LIB environmental variable or with a command-line argument to Perl.

Here's is the output from the test program:

 Locations are 3 11    EcoRI   EcoRI   ACGAATTCCGGAATTCG Here is the map of the bigger sequence: AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT                      EcoRI                         GGGAGGCGTGACTAGAAGCGGGAATTCAAGTAGTTGTGGGCGCCTTTGCA ACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGC                                                             HindIII                                   GGGGGTCGTAAGCTTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAG ATGGTTCAGACCCAGAGCCTCCAGATGCCGGGGAGGACAGCAAGTCCGAG                                                    AATGGGGAGAATGCGCCCATCTACTGCATCTGCCGCAAACCGGACATCAA                                                                          HindIII                      CTGCTTCATGATCGGGTGTGACAAGCTTAACTGCAATGAGTGGTTCCATG                                                    GGGACTGCATCCGGATCAGCGGGATGGCAATGAGCGGGACAGCAGTGAGC                                                    CCCGGGATGAGGGTGGAGGGCGCAAGAGGCCTGTCCCTGATCCAGACCTG                                                    CAGCGCCGGGCAGGGTCAGGGACAGGGGTTGGGGCCATGCTTGCTCGGGG                                                                                                 EcoRI CTCTGCTTCGCCCCACAAATCCTCTCCGCAGCCCTTGGTGGCCACGAATT                       CACCCAGCCAGCATCACCAGCAGCAGCAGCAGCAGATCAAACGGTCAGCC       EcoRI                                        HindIII                                            AAGCTTGAATTCCGCATGTGTGGTGAGTGTGAGGCACCAGTGACGCCCTC                                                    AGAGTCCCTGCCAAGGCCCCGCCGGCCACTGCCCACCCAACAGCAGCCAC                                                                       EcoRI                           AGCCATCACAGAAGTTAGGGAATTCGCGCATCCGTGAAGATGAGGGGGCA                                                                                   EcoRI               GTGGCGTCATCAACAGTCAAGGAGCCTCCTGGAATTCAGGCTACAGCCAC ACCTGAGCCACTCTCAGATGAGGACCTA 

As you see, the sequence is printed out formatted for the page with the names of restriction enzymes appearing above their recognition sites.



Mastering Perl for Bioinformatics
Mastering Perl for Bioinformatics
ISBN: 0596003072
EAN: 2147483647
Year: 2003
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net