5.3 Restriction.pm: Finding Recognition Sites | Mastering Perl for Bioinformatics

The time has now come to write the class that creates an object out of a sequence, enzyme name (s), and the map of the location(s) of the enzyme recognition sites in the sequence.

This module depends a great deal on the module Rebase developed in the previous section, but it is fairly short because it just tries to do a small job. This new Restriction class takes a Rebase object (which has the Rebase database translated to regular expressions), some sequence, and a list of enzymes, and uses the regular expressions to find the recognition sites in the sequence. (Note that it doesn't use inheritance; it simply creates a Rebase object to use.)

In this module, the restriction map (the list of locations where the enzymes have recognition sites in the sequence) is obtained through an auxiliary method map_enzyme that simply lists the locations. Clearly, a more graphical display would be easier and more useful. I'll consider that as the book progresses.

5.3.1 The Restriction.pm Module

Here is the Restriction.pm module:

 package Restriction; # # A class to find locations of restriction enzyme recognition sites in #  DNA sequence data. # use strict; use warnings; use Carp; # Class data and methods {    # A list of all attributes with default values.    # "enzyme" is given as an argument possibly multiple time, set as key to _map hash     my %_attributes = (         _rebase      => { },  # A Rebase.pm hash-based object         # key   = restriction enzyme name         # value = space-separated string of recognition sites => regular expressions         _sequence    => '', # DNA sequence data in raw format (only bases)         _map         => { },# a hash: keys are enzyme names,                              # values are arrays of locations          _enzyme      => '', # space- or comma-separated enzyme names,                              # set as key to _map hash     );     # Global variable to keep count of existing objects     my $_count = 0;     # Return a list of all attributes     sub _all_attributes {         keys %_attributes;     }     # Manage the count of existing objects     sub get_count {         $_count;     }     sub _incr_count {         ++$_count;     }     sub _decr_count {         --$_count;     } }

This opening block shows that Restriction is also a class with a small and simple set of attributes. One is just the DNA sequence data, _sequence .

The attribute _map is a hash that stores the computed restriction map with a key for each restriction enzyme and a value consisting of an array of the positions in the sequence where recognition sites for that enzyme occur.

The attribute _enzyme is one or more enzyme names separated by spaces or commas.

The remaining attribute _rebase is an object of the class I developed in the last section, the class Rebase . Recall that a Rebase object gives you the ability to get regular expressions for recognition sites of restriction enzymes. When you call Restriction->new you must include as one of its arguments a Rebase object, as is shown in the example given in the POD documentation later in this section.

Since a program can have many restriction maps, I also maintain the count of Restriction objects.

5.3.1.1 Initializing Restriction objects

Next , the new constructor method creates and initializes Restriction objects:

 # The "new" constructor method, called from class, e.g. sub new {     my ($class, %args) = @_;     # Create a new object     my $self = bless {  }, $class;     # Set the attributes for the provided arguments     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_name",  argument = "name"         my($argument) = ($attribute =~ /^_(.*)/);         if (exists $args{$argument}) {             if($argument eq 'enzyme') {                 # permit space or comma separated enzyme names                 $args{$argument} =~ s/,/ /g;             }             $self->{$attribute} = $args{$argument};         }     }     # Check that the correct arguments are given     if( not defined $self->{_rebase} ) {         croak "A Rebase object must be given as an argument";     }elsif( ref($self->{_rebase}) ne 'Rebase' ) {         croak "The argument to rebase is not a Rebase object";     }elsif( not defined $self->{_sequence} ) {         croak "A sequence must be given as an argument";     }     # Calculate the locations for each enzyme, store in _map hash attribute     foreach my $enzyme (split(" ", $self->{_enzyme})) {         $self->map_enzyme($enzyme);     }     $self->_incr_count;     return $self; } # For this simple class I have no AUTOLOAD or DESTROY # No get_rebase method, I don't want to pass around a huge hash # No set mutators: all initialization done by way of "new" constructor # No clone method.  Each sequence and set of enzymes can be easily calculated #  by means of a "new" command.

5.3.1.2 The methods explained

The new constructor method checks that the proper and required arguments are provided: a sequence, enzyme(s), and a Rebase object.

Then the new constructor performs the required computation by calling the method map_enzyme for each enzyme to determine the (possibly empty) array of locations in which the enzyme can cut the sequence. These enzyme-array key/value pairs are stored in the _map attribute.

As you can see in the next segment of code from Restriction.pm , the method map_enzyme works by getting the regular expressions that have been calculated for the enzyme and then finding the positions where the regular expressions match the sequence.

The get_regular_expressions method accesses the Rebase object that was passed to the Restriction->new constructor method and appears as the $self->{_rebase} attribute. In that object, the attribute _rebase is the hash of the database, which looks for the value of the key $enzyme .

The match_positions method puts a match of a regular expression in a while loop, where it loops once for each location it finds. The offset of the matching sequence is available in the special variable $-[0] after a successful regular-expression match (see the perlvar section of the Perl documentation for more details).

 sub map_enzyme {     my($self, $enzyme) = @_;     my(@positions) = (  );     my(@res) = $self->get_regular_expressions($enzyme);     foreach my $re (@res) {         push @positions, $self->match_positions($re);     }     @{$self->{_map}{$enzyme}} = @positions;     return @positions; } sub get_regular_expressions {     my($self, $enzyme) = @_;     my(%sites) = split(' ', $self->{_rebase}{_rebase}{$enzyme});     # May have duplicate values     return values %sites; } # Find positions of a regular expression in the sequence sub match_positions {     my($self, $regexp) = @_;     my @positions = (  );     # Determine positions of regular expression matches     while ( $self->{_sequence} =~ /$regexp/ig ) {         push @positions, ($-[0] + 1 );     }     return(@positions); }

Finally, here are some helper methods that report on the attributes of sequence and of the calculated map for a given enzyme (or the entire hash that contains the map of all the enzymes). I didn't use AUTOLOAD here because there are only a handful of attributes that return a few different data types: array, scalar string, and hash.

 sub get_enzyme_map {     my($self, $enzyme) = @_;     @{$self->{_map}{$enzyme}}; } sub get_enzyme_names {     my($self) = @_;     keys %{$self->{_map}}; } sub get_sequence {     my($self) = @_;     $self->{_sequence}; } sub get_map {     my($self) = @_;     %{$self->{_map}}; }

5.3.1.3 Documentation

Here's the (bare bones) documentation for the class embedded right in the module using the POD documentation language:

 =head1 Restriction Restriction: Given a Rebase object, sequence, and list of restriction enzyme     names, return the locations of the recognition sites in the sequence =head1 Synopsis     use Restriction;     use Rebase;     use strict;     use warnings;     my $rebase = Rebase->new(         dbmfile => 'BIONET',         bionetfile => 'bionet.212'     );     my $restrict = Restriction->new(         rebase => $rebase,         enzyme => 'EcoRI, HindIII',         sequence => 'ACGAATTCCGGAATTCG',     );         print "Locations for EcoRI are ", join(' ',                                      $restrict->get_enzyme_map('EcoRI')), "\n"; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut 1;

Saving the Synopsis example in a file and running it (making sure the required bionet.212 file is in place) gives the output:

 EcoRI data in Rebase is GAATTC GAATTC Sequence is ACGAATTCCGGAATTCG Locations for EcoRI are 3 11