5.2 Rebase.pm: A Class Module | Mastering Perl for Bioinformatics

Here is a very simple interface to the Rebase data contained in the bionet file that is part of its distribution:

 package Rebase; # # A simple class to provide access to restriction enzyme data from Rebase #  including regular expression translations of recognition sites # use strict; use warnings; use Carp; use DB_File; # Class data and methods {     # A hash of all attributes with default values     my %_attributes = (         _rebase      => { },              #    key   = restriction enzyme name              #    value = space-separated string of sites => regular expressions         _bionetfile  => '??',         _dbmfile     => '??',         _mode        => 0444,     );     # Return a list of all attributes     sub _all_attributes {         keys %_attributes;     }     # Return the value of an attribute     sub _attribute_value {         my($self,$attribute) = @_;         $_attributes{$attribute};     } }

Notice that the opening block is considerably pared down, compared to earlier classes. For instance, I've tossed the code that keeps count of all objects. Why? Because it's unlikely that more than one of these objects will be necessary in a program: so why bother?

5.2.1 Attributes: Short and Sweet

Notice that the list of attributes is short:

_rebase: A hash that will be populated to provide the lookup, with enzyme names for keys, and recognition sites (and their translation to regular expressions) for values. (Make sure you see how in the hash %_attributes the value of the key _rebase is itself an anonymous hash.)
_bionetfile: The name of the datafile from the Rebase distribution. In my examples, I use the version numbered bionet.212 , and by the time you read this book, more recent versions will be available (you can get bionet.212 from this book's web site).
_dbmfile: The DBM filename that resides on disk and stores the data in the hash _rebase . ^[1]

^[1] Recall that DBM files are tied to hashes in Perl and provide a simple, easy-to-program database for key/value pairs. They serve as a way to keep a hash on disk between invocations of a program, and so can help you avoid the cost of recalculating a hash each time a program is run. For more information, see O'Reilly's Programming Perl , the documentation for the DB_File module, and the documentation for the dbmopen and tie functions.
_mode: Contains the permissions with which you will create, or attempt to read, the DBM file. This is important for security purposes.

With so few attributes, the class methods can easily handle each attribute individually, without recourse to the use of AUTOLOAD to define various accessors and mutators, as seen in previous chapters.

5.2.2 Creating a Rebase Object

Here's how a Rebase object is created and initialized :

 # The constructor method # Called from class, e.g. $obj = Rebase->new( dbmfile => 'DBMFILE' ); sub new {     my ($class, %arg) = @_;     # Create a new object     my $self = bless {  }, $class;     # DBM file must be given as "dbmfile" argument     unless($arg{dbmfile}) {         croak("No dbm file specified");     }     # Set the attributes for the provided arguments     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_name",  argument = "name"         my($argument) = ($attribute =~ /^_(.*)/);         # Initialize to defaults         $self->{$attribute} = $self->_attribute_value($attribute);         # Override defaults with arguments         if (exists $arg{$argument}) {             if($argument eq 'rebase') {                 croak "Cannot set attribute rebase";             }             $self->{$attribute} = $arg{$argument};         }     }     # Open or create the DBM file     unless(tie %{$self->{_rebase}}, 'DB_File', $arg{dbmfile}, O_RDWRO_CREAT, $self-> {_mode}, $DB_HASH) {         my $permissions = sprintf "%lo", $self->{_mode};         croak "Cannot open DBM file $arg{dbmfile} with mode $permissions";     }     # If "bionetfile" argument given, calculate the hash from the bionet file     if($arg{bionetfile}) {         # Empty the hash         %{$self->{_rebase}} = (  );         # Recalculate the hash         $self->parse_rebase;     }     return $self; } # For this simple class I have no AUTOLOAD or DESTROY # No get_rebase method, I don't want to pass around a huge hash # No "set" mutators: all initialization done by way of "new" constructor

The new constructor method is short. It requires a DBM database filename (existing or new), to which the hash data structure _rebase is tied. If the DBM file doesn't exist, the pathname of the bionet Rebase file is also required (that's where the data comes from that populates the DBM file). If a bionet datafile is given, the method calls the parse_rebase method that parses the bionet file to create the _rebase hash.

As the comments indicate , my class is so simple I've even decided to do away with AUTOLOAD and DESTROY , and I've dispensed with the _set mutators as well.

5.2.3 Methods for the Rebase Class

Now, let's continue by looking at the methods for the Rebase class. Given an enzyme, the following two methods, get_recognition_sites and get_regular_expressions , retrieve the enzyme's recognition sites and the translations of the recognition sites into regular expressions.

How do these two methods work? One method returns all recognition sites for an enzyme as given in the Rebase database; the other returns all the translations of the recognition sites into regular expressions. They both work very similarly. First, the enzyme is looked up in the _rebase hash. The value for each enzyme in the _rebase hash is a space-separated string that alternates recognition sites with their regular-expression translations. In both methods, the space-separated string is split to get a list of alternating recognition sites and regular expressions. This list is then assigned to the hash %sites to populate it with keys as recognition sites (the data that's actually in the Rebase bionet file) and values as regular expressions. The Perl operators keys and values are then used to generate the list of recognition sites (keys) or regular expressions (values).

The get_bionetfile , get_dbmfile , and get_mode methods just report on the arguments that are set to specific filenames or mode strings when the object was created:

 sub get_regular_expressions {     my($self, $enzyme) = @_;     my(%sites) = split(' ', $self->{_rebase}{$enzyme});     # May have duplicate values     return values %sites; } sub get_recognition_sites {     my($self, $enzyme) = @_;     my(%sites) = split(' ', $self->{_rebase}{$enzyme});     return keys %sites; } sub get_bionetfile {     my($self) = @_;     return $self->{_bionetfile}; } sub get_dbmfile {     my($self) = @_;     return $self->{_dbmfile}; } sub get_mode {     my($self) = @_;     return $self->{_mode}; }

5.2.4 parse_rebase

The workhorse method of the class is parse_rebase , which reads the bionet Rebase datafile (with a suffix that indicates the release version, such as bionet.212 ). The bionet input datafile begins like this:

 REBASE version 212                                              bionet.212       =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=     REBASE, The Restriction Enzyme Database   http://rebase.neb.com     Copyright (c)  Dr. Richard J. Roberts, 2002.   All rights reserved.     =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   Rich Roberts                                                    Dec 01 2002   AaaI (XmaIII)                     C^GGCCG AacI (BamHI)                      GGATCC AaeI (BamHI)                      GGATCC AagI (ClaI)                       AT^CGAT AaqI (ApaLI)                      GTGCAC AarI                              CACCTGCNNNN^ AarI                              ^NNNNNNNNGCAGGTG AasI (DrdI)                       GACNNNN^NNGTC AatI (StuI)                       AGG^CCT AatII                             GACGT^C AauI (Bsp1407I)                   T^GTACA

As you can see, the header information ends with a line containing Rich Roberts . Apart from a blank line or two, the rest of the file contains records, one per line. Each record begins with a restriction enzyme name, optionally followed by another enzyme name in parentheses. The last field of each line is the recognition site. These are given using IUB codes for nucleotides. (For the IUB codes, see the comments in the program.)

They also contain cut sites, indicated by a caret symbol ^ . Cut sites contribute very important information about a restriction enzyme; they show where the enzyme makes the break when it cuts the DNA. Among other things, they are needed to correctly perform restriction digests in the computer when determining if there are overhangs that will be useful when inserting vectors or otherwise reassembling the fragments . However, in the code here we'll ignore the cut sites taking as a goal the virtual fingerprinting of DNA by just locating the recognition sites. Cut sites are omitted to simplify the code. (See the exercises for more on handling cut sites.)

 sub parse_rebase {     my($self) = @_;     # handles multiple definition lines for an enzyme name     # also handles alternate enzyme names on a line     # Read in the bionet(Rebase) file     unless(open(BIONETFH, $self->{_bionetfile})) {         croak("Cannot open bionet file $self->{_bionetfile}");     }     while(<BIONETFH>) {         my @names = (  );         # Discard header lines         next if ( 1 .. /Rich Roberts/ );# discard all lines from the first line                                         # to the first line containing "Rich Roberts"         # Discard blank lines         next unless /\S/; # discard a line unless it contains something not                           # whitespace         # Split the two (or three if includes parenthesized name) fields         my @fields = split;         # Get and store the recognition site         my $site = pop @fields;         # For the purposes of this exercise, I'll ignore cut sites (^).         # This is not something you'd want to do in general, however!         $site =~ s/\^//g;         # Get and store the name and the recognition site.         # Add alternate (parenthesized) names         # from the middle field, if any         foreach my $name (@fields) {             $name =~ tr/)(//d;  # delete parentheses             push @names, $name;         }         # Store the data into the hash, avoiding duplicates (ignoring ^ cut sites)         # and ignoring reverse complements         # Because these values are stored via DBM, I cannot use anything but         #  a scalar string to store the site/regularexpression pairs, space-separated         #  (but see the exercises)         foreach my $name (@names) {             # Add new enzyme definition             unless(exists $self->{_rebase}{$name}) {                 $self->{_rebase}{$name} = "$site " . IUB_to_regexp($site);                 next;             }             my(%defined_sites) = split(' ', $self->{_rebase}{$name});             # Omit already defined sites             if(exists $defined_sites{$site}) {                 next;             # Omit reverse complements of already defined sites             }elsif(exists $defined_sites{revcomIUB($site)}) {                 next;             # Add the additional site             }else{                 $self->{_rebase}{$name}  .= " $site " . IUB_to_regexp($site);             }         }     }     return 1; }

This subroutine is a bit complex, corresponding to the nature of the data that it's processing. For instance, because enzymes can appear on more than one line, it has to check if an enzyme was already entered as a key in the hash.

Let me remind you of the range operator that is used here to skip header lines:

 # Discard header lines  next if ( 1 .. /Rich Roberts/ );  # discard all lines from the first line                                    # to the first line containing "Rich Roberts"

The expression ( 1 . . /Rich Roberts/ ) returns true (and leads to the line being skipped ) only when the line being read is included in the range bordered by the first line and the first line containing the regular expression /Rich Roberts/ . (See the perlop section of the Perl manual for all the details on the range operator.)

The parse_rebase subroutine, after skipping the header and any blank lines, then processes each data line in a while loop. Each line is split into either two fields (name and recognition site) or three fields (name, parenthesized alternate name, and recognition site). The name or names are placed in the @names array and looped through. In the last foreach loop, if the enzyme name hasn't yet been defined in the DBM-tied hash, it is added as a key. The value assigned to the key is a string with recognition site followed by a translation of the recognition site to a regular expression. The program passes to the next name.

If the enzyme name has previously been entered as a key, the previously entered recognition sites are examined, and if the new site is there, the program passes to the next name. Similarly, if the reverse complement of the site has been entered, the program passes to the next name. But otherwise (if the enzyme name was entered, but neither the site nor its reverse complement were in the list of sites for that enzyme), the recognition site is added with its translation to a regular expression.

This method has to handle reverse complements of recognition sites. Many restriction enzymes are palindromic in the sense that their reverse complements are the same. (For instance, the reverse complement of GAATTC is GAATTC.) These biological palindromes indicate that the enzyme can cut the site on both strands.

Although the code presented here ignores cut sites, the exercises will ask you to reconsider them; note that if the cut site isn't exactly in the middle, there will be "sticky ends" of single stranded DNA that make it possible to anneal the fragment with a complementary sticky end. Refer to a standard molecular biology textbook for the essential biology of restriction enzymes. (The logic used here to handle reverse complements might not be ideal for all situations: see the exercises.)

5.2.5 Methods to Translate Nucleotides to Regular Expressions

Finally, the remaining methods translate IUB-coded nucleotide sequence data to Perl regular expressions. They also perform reverse complementation on IUB-coded sequence data.

 sub revcomIUB {     my($seq) = @_;     my $revcom = reverse complementIUB($seq);     return $revcom; } sub complementIUB {     my($seq) = @_;     (my $com = $seq) =~ tr [ACGTRYMKSWBDHVNacgtrymkswbdhvn]                               [TGCAYRKMWSVHDBNtgcayrkmwsvhdbn];     return $com; } # Translate IUB ambiguity codes to regular expressions  # IUB_to_regexp # # A subroutine that, given a sequence with IUB ambiguity codes, # outputs a translation with IUB codes changed to regular expressions # # These are the IUB ambiguity codes # (Eur. J. Biochem. 150: 1-5, 1985): # R = G or A # Y = C or T # M = A or C # K = G or T # S = G or C # W = A or T # B = not A (C or G or T) # D = not C (A or G or T) # H = not G (A or C or T) # V = not T (A or C or G) # N = A or C or G or T  sub IUB_to_regexp {     my($iub) = @_;     my $regular_expression = '';     my %iub2character_class = (         A => 'A',         C => 'C',         G => 'G',         T => 'T',         R => '[GA]',         Y => '[CT]',         M => '[AC]',         K => '[GT]',         S => '[GC]',         W => '[AT]',         B => '[CGT]',         D => '[AGT]',         H => '[ACT]',         V => '[ACG]',         N => '[ACGT]',     );     # Remove the ^ signs from the recognition sites     $iub =~ s/\^//g;     # Translate each character in the iub sequence     for ( my $i = 0 ; $i < length($iub) ; ++$i ) {         $regular_expression           .= $iub2character_class{substr($iub, $i, 1)};     }     return $regular_expression; } 1; =head1 Rebase Rebase: A simple interface to recognition sites and translations of them into         regular expressions, from the Restriction Enzyme Database (Rebase) =head1 Synopsis     use Rebase;     # Use "bionetfile" to create and populate dbm file     my $rebase = Rebase->new(         dbmfile => 'BIONET',         bionetfile => 'bionet.212',         mode => 0644     );     # Use without "bionetfile" to attach to existing dbm file     my $rebase = Rebase->new(         dbmfile => 'BIONET',         mode => 0444     );     my $enzyme = 'EcoRI';     print "Looking up restriction enzyme $enzyme\n";     my @sites = $rebase->get_recognition_sites($enzyme);     print "Sites are @sites\n";     my @res = $rebase->get_regular_expressions($enzyme);     print "Regular expressions are @res\n";     print "DBM file is ", $rebase->get_dbmfile, "\n";     print "Rebase bionet file is ", $rebase->get_bionetfile, "\n"; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut

5.2.6 Testing the Module

Ending the module, as usual, is some POD documentation for the module. Recall that you can view the output of this documentation in various ways, as HTML on a web page, as PostScript, etc. However, the simplest way is to say the following at the command line:

 perldoc Rebase.pm

Let's try running the sample code given in the documentation. Notice that the Rebase.pm module is available, as is the bionet.212 file from the Rebase distribution. Also, notice there are two alternate calls to Rebase->new , so you should comment out the first one, then the other, in tests. Save the sample code from the documentation in a file called testRebase , and when you run it with the command:

 perl testRebase

you get the following output:

 Looking up restriction enzyme EcoRI Sites are GAATTC Regular expressions are GAATTC DBM file is BIONET Rebase bionet file is bionet.212