Here is a very simple interface to the Rebase data contained in the bionet file that is part of its distribution: package Rebase; # # A simple class to provide access to restriction enzyme data from Rebase # including regular expression translations of recognition sites # use strict; use warnings; use Carp; use DB_File; # Class data and methods { # A hash of all attributes with default values my %_attributes = ( _rebase => { }, # key = restriction enzyme name # value = space-separated string of sites => regular expressions _bionetfile => '??', _dbmfile => '??', _mode => 0444, ); # Return a list of all attributes sub _all_attributes { keys %_attributes; } # Return the value of an attribute sub _attribute_value { my($self,$attribute) = @_; $_attributes{$attribute}; } } Notice that the opening block is considerably pared down, compared to earlier classes. For instance, I've tossed the code that keeps count of all objects. Why? Because it's unlikely that more than one of these objects will be necessary in a program: so why bother? 5.2.1 Attributes: Short and SweetNotice that the list of attributes is short:
With so few attributes, the class methods can easily handle each attribute individually, without recourse to the use of AUTOLOAD to define various accessors and mutators, as seen in previous chapters. 5.2.2 Creating a Rebase ObjectHere's how a Rebase object is created and initialized : # The constructor method # Called from class, e.g. $obj = Rebase->new( dbmfile => 'DBMFILE' ); sub new { my ($class, %arg) = @_; # Create a new object my $self = bless { }, $class; # DBM file must be given as "dbmfile" argument unless($arg{dbmfile}) { croak("No dbm file specified"); } # Set the attributes for the provided arguments foreach my $attribute ($self->_all_attributes( )) { # E.g. attribute = "_name", argument = "name" my($argument) = ($attribute =~ /^_(.*)/); # Initialize to defaults $self->{$attribute} = $self->_attribute_value($attribute); # Override defaults with arguments if (exists $arg{$argument}) { if($argument eq 'rebase') { croak "Cannot set attribute rebase"; } $self->{$attribute} = $arg{$argument}; } } # Open or create the DBM file unless(tie %{$self->{_rebase}}, 'DB_File', $arg{dbmfile}, O_RDWRO_CREAT, $self-> {_mode}, $DB_HASH) { my $permissions = sprintf "%lo", $self->{_mode}; croak "Cannot open DBM file $arg{dbmfile} with mode $permissions"; } # If "bionetfile" argument given, calculate the hash from the bionet file if($arg{bionetfile}) { # Empty the hash %{$self->{_rebase}} = ( ); # Recalculate the hash $self->parse_rebase; } return $self; } # For this simple class I have no AUTOLOAD or DESTROY # No get_rebase method, I don't want to pass around a huge hash # No "set" mutators: all initialization done by way of "new" constructor The new constructor method is short. It requires a DBM database filename (existing or new), to which the hash data structure _rebase is tied. If the DBM file doesn't exist, the pathname of the bionet Rebase file is also required (that's where the data comes from that populates the DBM file). If a bionet datafile is given, the method calls the parse_rebase method that parses the bionet file to create the _rebase hash. As the comments indicate , my class is so simple I've even decided to do away with AUTOLOAD and DESTROY , and I've dispensed with the _set mutators as well. 5.2.3 Methods for the Rebase ClassNow, let's continue by looking at the methods for the Rebase class. Given an enzyme, the following two methods, get_recognition_sites and get_regular_expressions , retrieve the enzyme's recognition sites and the translations of the recognition sites into regular expressions. How do these two methods work? One method returns all recognition sites for an enzyme as given in the Rebase database; the other returns all the translations of the recognition sites into regular expressions. They both work very similarly. First, the enzyme is looked up in the _rebase hash. The value for each enzyme in the _rebase hash is a space-separated string that alternates recognition sites with their regular-expression translations. In both methods, the space-separated string is split to get a list of alternating recognition sites and regular expressions. This list is then assigned to the hash %sites to populate it with keys as recognition sites (the data that's actually in the Rebase bionet file) and values as regular expressions. The Perl operators keys and values are then used to generate the list of recognition sites (keys) or regular expressions (values). The get_bionetfile , get_dbmfile , and get_mode methods just report on the arguments that are set to specific filenames or mode strings when the object was created: sub get_regular_expressions { my($self, $enzyme) = @_; my(%sites) = split(' ', $self->{_rebase}{$enzyme}); # May have duplicate values return values %sites; } sub get_recognition_sites { my($self, $enzyme) = @_; my(%sites) = split(' ', $self->{_rebase}{$enzyme}); return keys %sites; } sub get_bionetfile { my($self) = @_; return $self->{_bionetfile}; } sub get_dbmfile { my($self) = @_; return $self->{_dbmfile}; } sub get_mode { my($self) = @_; return $self->{_mode}; } 5.2.4 parse_rebaseThe workhorse method of the class is parse_rebase , which reads the bionet Rebase datafile (with a suffix that indicates the release version, such as bionet.212 ). The bionet input datafile begins like this: REBASE version 212 bionet.212 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= REBASE, The Restriction Enzyme Database http://rebase.neb.com Copyright (c) Dr. Richard J. Roberts, 2002. All rights reserved. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Rich Roberts Dec 01 2002 AaaI (XmaIII) C^GGCCG AacI (BamHI) GGATCC AaeI (BamHI) GGATCC AagI (ClaI) AT^CGAT AaqI (ApaLI) GTGCAC AarI CACCTGCNNNN^ AarI ^NNNNNNNNGCAGGTG AasI (DrdI) GACNNNN^NNGTC AatI (StuI) AGG^CCT AatII GACGT^C AauI (Bsp1407I) T^GTACA As you can see, the header information ends with a line containing Rich Roberts . Apart from a blank line or two, the rest of the file contains records, one per line. Each record begins with a restriction enzyme name, optionally followed by another enzyme name in parentheses. The last field of each line is the recognition site. These are given using IUB codes for nucleotides. (For the IUB codes, see the comments in the program.) They also contain cut sites, indicated by a caret symbol ^ . Cut sites contribute very important information about a restriction enzyme; they show where the enzyme makes the break when it cuts the DNA. Among other things, they are needed to correctly perform restriction digests in the computer when determining if there are overhangs that will be useful when inserting vectors or otherwise reassembling the fragments . However, in the code here we'll ignore the cut sites taking as a goal the virtual fingerprinting of DNA by just locating the recognition sites. Cut sites are omitted to simplify the code. (See the exercises for more on handling cut sites.) sub parse_rebase { my($self) = @_; # handles multiple definition lines for an enzyme name # also handles alternate enzyme names on a line # Read in the bionet(Rebase) file unless(open(BIONETFH, $self->{_bionetfile})) { croak("Cannot open bionet file $self->{_bionetfile}"); } while(<BIONETFH>) { my @names = ( ); # Discard header lines next if ( 1 .. /Rich Roberts/ );# discard all lines from the first line # to the first line containing "Rich Roberts" # Discard blank lines next unless /\S/; # discard a line unless it contains something not # whitespace # Split the two (or three if includes parenthesized name) fields my @fields = split; # Get and store the recognition site my $site = pop @fields; # For the purposes of this exercise, I'll ignore cut sites (^). # This is not something you'd want to do in general, however! $site =~ s/\^//g; # Get and store the name and the recognition site. # Add alternate (parenthesized) names # from the middle field, if any foreach my $name (@fields) { $name =~ tr/)(//d; # delete parentheses push @names, $name; } # Store the data into the hash, avoiding duplicates (ignoring ^ cut sites) # and ignoring reverse complements # Because these values are stored via DBM, I cannot use anything but # a scalar string to store the site/regularexpression pairs, space-separated # (but see the exercises) foreach my $name (@names) { # Add new enzyme definition unless(exists $self->{_rebase}{$name}) { $self->{_rebase}{$name} = "$site " . IUB_to_regexp($site); next; } my(%defined_sites) = split(' ', $self->{_rebase}{$name}); # Omit already defined sites if(exists $defined_sites{$site}) { next; # Omit reverse complements of already defined sites }elsif(exists $defined_sites{revcomIUB($site)}) { next; # Add the additional site }else{ $self->{_rebase}{$name} .= " $site " . IUB_to_regexp($site); } } } return 1; } This subroutine is a bit complex, corresponding to the nature of the data that it's processing. For instance, because enzymes can appear on more than one line, it has to check if an enzyme was already entered as a key in the hash. Let me remind you of the range operator that is used here to skip header lines: # Discard header lines next if ( 1 .. /Rich Roberts/ ); # discard all lines from the first line # to the first line containing "Rich Roberts" The expression ( 1 . . /Rich Roberts/ ) returns true (and leads to the line being skipped ) only when the line being read is included in the range bordered by the first line and the first line containing the regular expression /Rich Roberts/ . (See the perlop section of the Perl manual for all the details on the range operator.) The parse_rebase subroutine, after skipping the header and any blank lines, then processes each data line in a while loop. Each line is split into either two fields (name and recognition site) or three fields (name, parenthesized alternate name, and recognition site). The name or names are placed in the @names array and looped through. In the last foreach loop, if the enzyme name hasn't yet been defined in the DBM-tied hash, it is added as a key. The value assigned to the key is a string with recognition site followed by a translation of the recognition site to a regular expression. The program passes to the next name. If the enzyme name has previously been entered as a key, the previously entered recognition sites are examined, and if the new site is there, the program passes to the next name. Similarly, if the reverse complement of the site has been entered, the program passes to the next name. But otherwise (if the enzyme name was entered, but neither the site nor its reverse complement were in the list of sites for that enzyme), the recognition site is added with its translation to a regular expression. This method has to handle reverse complements of recognition sites. Many restriction enzymes are palindromic in the sense that their reverse complements are the same. (For instance, the reverse complement of GAATTC is GAATTC.) These biological palindromes indicate that the enzyme can cut the site on both strands.
5.2.5 Methods to Translate Nucleotides to Regular ExpressionsFinally, the remaining methods translate IUB-coded nucleotide sequence data to Perl regular expressions. They also perform reverse complementation on IUB-coded sequence data. sub revcomIUB { my($seq) = @_; my $revcom = reverse complementIUB($seq); return $revcom; } sub complementIUB { my($seq) = @_; (my $com = $seq) =~ tr [ACGTRYMKSWBDHVNacgtrymkswbdhvn] [TGCAYRKMWSVHDBNtgcayrkmwsvhdbn]; return $com; } # Translate IUB ambiguity codes to regular expressions # IUB_to_regexp # # A subroutine that, given a sequence with IUB ambiguity codes, # outputs a translation with IUB codes changed to regular expressions # # These are the IUB ambiguity codes # (Eur. J. Biochem. 150: 1-5, 1985): # R = G or A # Y = C or T # M = A or C # K = G or T # S = G or C # W = A or T # B = not A (C or G or T) # D = not C (A or G or T) # H = not G (A or C or T) # V = not T (A or C or G) # N = A or C or G or T sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ''; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); # Remove the ^ signs from the recognition sites $iub =~ s/\^//g; # Translate each character in the iub sequence for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; } 1; =head1 Rebase Rebase: A simple interface to recognition sites and translations of them into regular expressions, from the Restriction Enzyme Database (Rebase) =head1 Synopsis use Rebase; # Use "bionetfile" to create and populate dbm file my $rebase = Rebase->new( dbmfile => 'BIONET', bionetfile => 'bionet.212', mode => 0644 ); # Use without "bionetfile" to attach to existing dbm file my $rebase = Rebase->new( dbmfile => 'BIONET', mode => 0444 ); my $enzyme = 'EcoRI'; print "Looking up restriction enzyme $enzyme\n"; my @sites = $rebase->get_recognition_sites($enzyme); print "Sites are @sites\n"; my @res = $rebase->get_regular_expressions($enzyme); print "Regular expressions are @res\n"; print "DBM file is ", $rebase->get_dbmfile, "\n"; print "Rebase bionet file is ", $rebase->get_bionetfile, "\n"; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut 5.2.6 Testing the ModuleEnding the module, as usual, is some POD documentation for the module. Recall that you can view the output of this documentation in various ways, as HTML on a web page, as PostScript, etc. However, the simplest way is to say the following at the command line: perldoc Rebase.pm Let's try running the sample code given in the documentation. Notice that the Rebase.pm module is available, as is the bionet.212 file from the Rebase distribution. Also, notice there are two alternate calls to Rebase->new , so you should comment out the first one, then the other, in tests. Save the sample code from the documentation in a file called testRebase , and when you run it with the command: perl testRebase you get the following output: Looking up restriction enzyme EcoRI Sites are GAATTC Regular expressions are GAATTC DBM file is BIONET Rebase bionet file is bionet.212 |