6.8 A Rebase Database Implementation | Mastering Perl for Bioinformatics

In earlier chapters, I developed an object-oriented interface to Rebase that stores the data in a simple DBM hash database, and in this chapter I showed how to interact with a MySQL relational database management system.

Now, let's put the two things together.

In this section, I'll take the Rebase.pm module that implements the Rebase class and modify it to use a relational database instead of Perl's simple hash-based DBM database. I'll call the resulting class RebaseDB .

For a small problem like this, either approach works pretty well. In fact, on my computer, the DBM approach is considerably faster than the MySQL version.

The relational database implementation has a slower response due to the more complicated DBMS system involved and increased overhead in programming because a greater number of programming statements must be written. However, the reason for using it is probably clear: the relational database provides a lot more flexibility in making queries, organizing the data, and, especially , expanding the application to handle a greater variety of data.

This flexibility is very important. The Rebase database from http://www.neb.com/rebase includes several more datafiles, such as where to get the enzymes. It wouldn't be difficult to add a table or tables to store that information in the relational version of my Perl program; it would be a major pain to keep adding new DBM hashes for various complex relationships. So, I want a relational database because it has good scalability as the database application grows.

6.8.1 RebaseDB Class: Accessing Restriction Enzyme Data

Following is the code for a new class, RebaseDB , which aims to provide the same functionality as the previous class Rebase , but with a relational database instead of a DBM hash data storage.

In the interest of space, the code for the following subroutines isn't reproduced here; look for them in Chapter 5: revcomIUB , complementIUB , and IUB_to_regexp . They are, of course, included in the RebaseDB.pm module available for downloading from this book's web page.

 package RebaseDB; # # A simple class to provide access to restriction enzyme data from Rebase #  including regular expression translations of recognition sites # Data is stored in a MySQL database # use strict; use warnings; use DBI; use Carp; # Class data and methods {     # A hash of all attributes with default values     my %_attributes = (         _rebase      => { },  # unused in this implementation                               #    key   = restriction enzyme name                               #    value = pairs of sites and regular expressions         _mysql       => '??', #  e.g. mysql => 'rebase:localhost',         _dbh         => '',   # database handle from DBI->connect         _bionetfile  => '??', # source of data from e.g. "bionet.212" file     );              # Return a list of all attributes     sub _all_attributes {             keys %_attributes;     } } # The constructor method # Called from class, e.g. $obj = Rebase->new( mysql => 'localhost:rebase' ); sub new {     my ($class, %arg) = @_;     # Create a new object     my $self = bless {  }, $class;     # Set the attributes for the provided arguments     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_name",  argument = "name"         my($argument) = ($attribute =~ /^_(.*)/);         if (exists $arg{$argument}) {             if($argument eq 'rebase') {                 croak "Cannot set attribute rebase";             }             $self->{$attribute} = $arg{$argument};         }     }     # MySQL host:database string must be given as "mysql" argument     unless($arg{mysql}) {         croak("No MySQL host:database specified");     }     # Connect to the Rebase database     my $user = 'tisdall';     my $passwd = 'NOTmyPASSWORD';     my $dbh;     unless($dbh = DBI->connect("dbi:mysql:$arg{mysql}", $user, $passwd)) {         carp "Cannot connect to MySQL database at $arg{dbmfile}";         return;     }     $self->setDBhandle($dbh);     # If "bionetfile" argument given, populate the database from the bionet file     if($arg{bionetfile}) {         $self->parse_rebase(  );     }     return $self; } # For this simple class I have no AUTOLOAD or DESTROY # No "set" mutators: all initialization done by way of "new" constructor sub get_regular_expressions {     my($self, $enzyme) = @_;     my $dbh = $self->getDBhandle;     my $sth = $dbh->prepare(         'select Regex from REGEXES, ENZYMES where          ENZYMES.EnzId = REGEXES.EnzId and ENZYMES.Enzyme=?'     );     $sth->execute($enzyme);     my @regexes;     while( my $row = $sth->fetchrow_arrayref) {             push(@regexes, $$row[0]);     }     return @regexes; } sub getDBhandle {     my($self) = @_;     return $self->{_dbh}; } sub setDBhandle {     my($self, $dbh) = @_;     return $self->{_dbh} = $dbh; } sub get_recognition_sites {     my($self, $enzyme) = @_;     my $dbh = $self->getDBhandle;     my $sth = $dbh->prepare(         'select Site from SITES, ENZYMES          where ENZYMES.EnzId = SITES.EnzId and ENZYMES.Enzyme=?'     );     $sth->execute($enzyme);     my @sites;     while( my $row = $sth->fetchrow_arrayref) {             push(@sites, $$row[0]);     }     return @sites; } sub get_bionetfile {     my($self) = @_;     return $self->{_bionetfile}; } sub parse_rebase {     my($self) = @_;     # handles multiple definition lines for an enzyme name     # also handles alternate enzyme names on a line     # Get database handle     my $dbh = $self->getDBhandle(  );     # Delete existing tables, recreate them     # Prepare statement handles with "bind" variables and autoincrement     # ENZYMES table     my $drop = $dbh->prepare('drop table if exists ENZYMES');     $drop->execute(  );     my $create = $dbh->prepare(         "CREATE TABLE ENZYMES ( EnzId int(11) NOT NULL auto_increment default '0',          Enzyme varchar(255) NOT NULL default '', PRIMARY KEY  (EnzId)) TYPE=MyISAM"     );     $create->execute(  );     # Prepare filehandles outside of "while" loop     my $enzymes_select = $dbh->prepare(         'select EnzId from ENZYMES where Enzyme=?'     );     my $enzymes_insert =  $dbh->prepare(         'insert ENZYMES ( EnzId, Enzyme ) values ( NULL, ? )'     );      # SITES table     $drop = $dbh->prepare('drop table if exists SITES');     $drop->execute(  );     $create = $dbh->prepare(         "CREATE TABLE SITES ( SiteId int(11) NOT NULL auto_increment default '0',          EnzId int(11) NOT NULL default '0', Site varchar(255) NOT NULL default '',          PRIMARY KEY  (SiteId)) TYPE=MyISAM"     );     $create->execute(  );     # Prepare filehandles outside of "while" loop     my $sites_insert = $dbh->prepare(         'insert SITES ( SiteId, EnzId, Site ) values ( NULL, ?, ? )'     );     my $sites_select = $dbh->prepare(         'select EnzId, Site from SITES where EnzId=? and Site=?'     );     my $sitesrevcom_select = $dbh->prepare(         'select EnzId, Site from SITES where EnzId=? and Site=?'     );     # REGEXES table     $drop = $dbh->prepare('drop table if exists REGEXES');     $drop->execute(  );     $create = $dbh->prepare(         "CREATE TABLE REGEXES ( RegexId int(11) NOT NULL auto_increment default '0',          EnzId int(11) NOT NULL default '0', Regex varchar(255) NOT NULL default '',          PRIMARY KEY  (RegexId)) TYPE=MyISAM"     );     $create->execute(  );     # Prepare filehandles outside of "while" loop     my $regexes_insert = $dbh->prepare(         'insert REGEXES ( RegexId, EnzId, Regex ) values ( NULL, ?, ? )'     );     my $lastid =  $dbh->prepare('select LAST_INSERT_ID(  ) as pk');     # Read in the bionet(Rebase) file     unless(open(BIONETFH, $self->get_bionetfile)) {         croak("Cannot open bionet file " . $self->get_bionetfile);     }     while(<BIONETFH>) {         my @names = (  );         # Discard header lines         ( 1 .. /Rich Roberts/ ) and next;         # Discard blank lines         /^\s*$/ and next;              # Split the two (or three if includes parenthesized name) fields         my @fields = split( " ", $_);         # Get and store the recognition site         my $site = pop @fields;         # For the purposes of this exercise, I'll ignore cut sites (^).         # This is not something you'd want to do in general, however!         $site =~ s/\^//g;         # Get and store the name and the recognition site.         # Add alternate (parenthesized) names         # from the middle field, if any         foreach my $name (@fields) {             if($name =~ /\(.*\)/) {                 $name =~ s/\((.*)\)//;             }             push @names, $name;         }         # Store the data, avoiding duplicates (ignoring ^ cut sites)         # and ignoring reverse complements         foreach my $name (@names) {             my $pk;             my $row;             # if enzyme exists             $enzymes_select->execute($name);             if($row = $enzymes_select->fetchrow_arrayref) {                     # get its "pk"                 $pk = $$row[0];             }else{                 # Add new enzyme definition                 $enzymes_insert->execute($name);                 # Get last autoincremented primary id                 $lastid->execute(  );                 my $pkhash = $lastid->fetchrow_hashref;                 $pk = $pkhash->{pk};             }             # if pk,site exist go to top of loop             $sites_select->execute($pk, $site);             if($row = $sites_select->fetchrow_arrayref) {                 next;             }             # and if pk,revcomIUB(site) exist go to top of loop             $sitesrevcom_select->execute($pk, revcomIUB($site));             if($row = $sitesrevcom_select->fetchrow_arrayref) {                 next;             }             # Add new site definition             #  since neither pk,site nor             #  pk,revcomIUB(site) exists.             $sites_insert->execute($pk, $site);             # Add new regex definition             $regexes_insert->execute($pk, IUB_to_regexp($site));         }     }     return 1; } 1; =head1 RebaseDB Rebase: A simple interface to recognition sites and translations of them into         regular expressions, from the Restriction Enzyme Database (Rebase) =head1 Synopsis     use RebaseDB;     my $rebase = RebaseDB->new(         mysql => 'rebase:localhost',         bionetfile => 'bionet.212'     );     my $enzyme = 'EcoRI';     print "Looking up restriction enzyme $enzyme\n";     my @sites = $rebase->get_recognition_sites($enzyme);     print "Sites are @sites\n";     my @res = $rebase->get_regular_expressions($enzyme);     print "Regular expressions are @res\n";     my $enzyme = 'HindIII';     print "Looking up restriction enzyme $enzyme\n";     my @sites = $rebase->get_recognition_sites($enzyme);     print "Sites are @sites\n";     my @res = $rebase->get_regular_expressions($enzyme);     print "Regular expressions are @res\n";     print "Rebase bionet file is ", $rebase->get_bionetfile, "\n"; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut

6.8.2 testRebaseDB: A Testing Program

The following test program testRebaseDB is taken from the RebaseDB.pm documentation and slightly altered .

 use lib "/home/tisdall/MasteringPerlBio/development/lib";          use RebaseDB;     my $rebase = RebaseDB->new(         mysql => 'rebase:localhost',         bionetfile => 'bionet.212'     );     my $enzyme = 'EcoRI';     print "Looking up restriction enzyme $enzyme\n";     my @sites = $rebase->get_recognition_sites($enzyme);     print "Sites are @sites\n";     my @res = $rebase->get_regular_expressions($enzyme);     print "Regular expressions are @res\n";     my $enzyme = 'HindIII';     print "Looking up restriction enzyme $enzyme\n";     my @sites = $rebase->get_recognition_sites($enzyme);     print "Sites are @sites\n";     my @res = $rebase->get_regular_expressions($enzyme);     print "Regular expressions are @res\n";     print "Rebase bionet file is ", $rebase->get_bionetfile, "\n";

Here's the output of testRebaseDB :

 Looking up restriction enzyme EcoRI Sites are GAATTC Regular expressions are GAATTC Looking up restriction enzyme HindIII Sites are AAGCTT Regular expressions are AAGCTT Rebase bionet file is bionet.212

6.8.3 Analyzing RebaseDB

Let's take a walk through this RebaseDB.pm class module to see how it uses the DBI relational database interface.

For starters, the %_attributes hash has changed to reflect the new relational database method. The _rebase key is no longer needed, but the bionetfile is again used to load the database; in the absence of this argument, the program attempts to use a previously loaded database.

Two new argument keys are present. _mysql gives the database name (e.g., rebase ) and the user's computer ( localhost if the user is running the program on the same computer the database is served from). _dbh holds the DBI object returned from the DBI->connect call.

The new constructor method has changed in significant ways. It checks the arguments a little bit differently, of course, because this version of the class has different arguments coming in. It then actually connects to the MySQL database using the DBI calls seen previously in this chapter. In case of failure, it doesn't carp or die; it prints an error message and returns ”a much better behavior than just dying. If this were a web script, for instance, you might want to keep running so you can return more input from the user in case of failure.

In case of success, the DBI->connect object reference is saved in the attribute reserved for this purpose, $self->{_dbh} . ( dbh stands for "data base handle".) Elsewhere in the module two methods are defined: the accessor method, getDBhandle and the mutator method, setDBhandle . They get and set this MySQL database object handle in the class attribute _dbh .

The program then calls the parse_rebase program. Let's look closely at this method and the workhorse that reads Rebase data in from a file and populates the MySQL database with it.

After retrieving as $dbh the object that points to the database, the method sends several SQL statements to the DBMS. These statements delete each of the three tables in this MySQL database called rebase , tossing out all their data in the process. The calls then create the tables anew (and empty).

This section of the code also prepares SQL statements that insert new rows with the values being passed in as bind arguments and selects various sets of rows. These SQL statements are used repeatedly in the while loop that follows , and the program saves time by having the SQL statements prepared just once before the loop.

I won't examine the various SQL statements in detail here. However, I do want to point out the use of autoincrementing on the ID field of each table. This option, applicable only to single-field keys that serve as the primary key for a table, has the effect of always picking the next value for the field, and is perfect for the unique ID field I usually want in a table as the primary key. It's just one less thing to worry about in the module. An SQL statement is also prepared that uses an SQL function to return the value of the last autoincremented key.

Now that the new database tables are fresh and empty, the program opens the bionet file and prepares to read it within a while loop. The beginning of this loop is unchanged from the previous version Rebase.pm because it discards the file header and blank lines and extracts the names of the restriction enzymes and the associated recognition sites.

First, the program checks to see if there is already a definition for the enzyme entered in the database; if so, it just retrieves its unique ID, the primary key $pk . If the enzyme hasn't been entered, it is now, and the ID that is automatically created for it is saved.

Next, the program checks if that primary key ID and site are already paired in the SITES table; if so, it goes back to the top of the while loop.

Again, the program checks if that primary key ID and the reverse complement of the site are already paired in the SITES table; if so, it goes back to the top of the while loop.

However, if the ID and the site are still unknown, they are entered into the SITES table, and the ID and the regular expression generated from the site are entered into the REGEXES table.

That wraps it up for the parse_rebase method. Because of all the database interactions, this is a pretty slow bit of code (see the exercises).

The only parts of the RebaseDB.pm module code we haven't looked at are the two similar methods called get_recognition_sites and get_regular_expressions . They retrieve all recognition sites (or regular expressions) associated with a given enzyme. The SQL statement in each of these methods is an example of a join of two tables:

 select Regex from REGEXES, ENZYMES where ENZYMES.EnzId = REGEXES.EnzId and ENZYMES.Enzyme=?;

(This is one SQL statement.) The statement asks for the regular expressions from the REGEXES table that have the same EnzID as the enzyme given in argument has in the ENZYMES table.

That's the end of my discussion of the RebaseDB.pm class module. See the exercises for suggestions on how to improve this module.