6.7 Perl DBI and DBD Interface Modules | Mastering Perl for Bioinformatics

SQL is a fairly simple and easy-to-learn language, considered by most to be well-tailored to its task. However, there are things available in most programming languages, such as control flow ( while , for , foreach ) and conditional branches ( if-else ) that aren't provided in most implementations of the language. The lack of these abilities severely restricts the use of SQL as a standalone language.

Most applications that use a relational database are written in another language such as Perl. Perl provides a link between the application, the programmer, the user , the files, the web server, and so on. Perl also provides the program logic. The interaction between the application and the database is typically to execute some database commands, such as fetching data from the database and processing it using Perl's capabilities. The logic of the program may depend on the data found in the database, but it is Perl, not SQL, that provides this logic (for the most part).

In Perl, a set of modules have been written that allow interaction with relational databases. The DataBase Independent (DBI) module handles most of the interaction from the program code; the DataBase Dependent (or DataBase Driver) (DBD) modules, different for each particular DBMS, handle communicating with the DBMS.

6.7.1 Installing and Configuring Perl DBI and DBD Modules

To use a MySQL database from Perl, you need first to have installed and properly configured MySQL. This is not a Perl job, but a database administration job; you have to get MySQL and install it on your system and set up the appropriate user accounts and permissions.

You have to then install the Perl DBI module (http://www.symbolstone.org/technology/perl/DBI) using CPAN from the command line:

 perl -MCPAN -e shell;

Then type:

 install DBI

You can also install it by downloading the module from CPAN, for example, via a web browser and following the QUICK START GUIDE instructions shown here:

 QUICK START GUIDE:     The DBI requires one or more 'driver' modules to talk to databases.     Check that a DBD::* module exists for the database you wish to use.     Read the DBI README then Build/test/install the DBI by doing             perl Makefile.PL             make             make test             make install     Then delete the source directory tree since it's no longer needed.     Use the 'perldoc DBI' command to read the DBI documentation.     Fetch the DBD::* driver module you wish to use and unpack it.     http://search.cpan.org/ (or www.activestate.com if on Windows)     It is often important to read the driver README file carefully.     Generally the build/test/install/delete sequence is the same     as for the DBI module. The DBI.pm file contains the DBI specification and other documentation. PLEASE READ IT. It'll save you asking questions on the mailing list which you will be told are already answered in the documentation. For more information and to keep informed about progress you can join the a mailing list via mailto:dbi-users-help@perl.org To help you make the best use of the dbi-users mailing list, and any other lists or forums you may use, I strongly recommend that you read "How To Ask Questions The Smart Way" by Eric Raymond:     http://www.tuxedo.org/~esr/faqs/smart-questions.html Much useful information and online archives of the mailing lists can be found at http://dbi.perl.org/ See also http://search.cpan.org/

Finally, you have to install the Perl DBD driver for MySQL, called DBD::MySQL . Look in CPAN at http://cpan.org/modules/by-module/DBD/ for the latest version; at the time of writing, it's http://cpan.org/modules/by-module/DBD/DBD-mysql-2.1026.tar.gz.

The combination of MySQL (the DBMS), DBD (the particular driver for your DBMS), and DBI (the Perl interface to the DBI and DBMS), is what gives the actual connection from Perl to the database and enables you to send SQL statements to the database and retrieve results.

Getting these components installed is sometimes the most difficult part of getting involved with database programming. Installing and configuring MySQL has several steps, and if you are very new to computers, you may find some of the instructions difficult to follow, as they may assume that you know more about your computer system than you do. DBI and DBD are typically much easier to install, but you may run into snags with them as well. The help of experienced hands, either directly or by means of the type of mailing list mentioned in the QUICK START GUIDE , can make the difference between days of frustration and a successful installation.

6.7.2 Handling Tab-Delimited Input Files

Let's say you have the components installed (MySQL, Perl, DBD, DBI), and you want to write a program that talks to the database. I'll assume you've implemented a new version of the homologs database as shown. We'll now walk through a small Perl example that shows how to read data in from a file, populate a database, send queries, and retrieve results.

First, here is the data as you might find it in a file. All the whitespace between the words is the tab character in the file, not space characters :

 TABLE        ORGANISM OrgId        Organism 1            human 2            worm 3            mouse TABLE      GENES GeneId     Gene        Date 118        aging       1984-07-13 9223       wrinkle     1987-08-15 273        hairy       1990-09-30 TABLE    VARIANTS VarId    OrgId    GeneId 1        1        118 2        2        118 3        1        9223 4        3        9223 5        3        273

There are several ways to find the data in a database. It's common to have it in a plain file that has tables represented by lines, one table row on each line, with the field values separated by tabs or some other character that doesn't appear in any of the values of any field.

You should bear in mind that this is just one of several possibilities for the source and format of your input data. See the interesting book Data Munging with Perl , by David Cross (Manning) for lots of useful lore about getting data in and out of various sources.

SQL itself provides a utility for this purpose, called load , which assumes that you have a file consisting of only rows of data. You can specify what columns to load, what delimiter the file uses (tab by default), and a few other options. Its performance is optimized, and it is much faster than executing several SQL insert statements. However, you still need to read data in from files in different formats: what better than your own program that you can alter to suit any occasion?

To start developing such a utility, here is a short program to populate the database. It reads the file and knows the table it's reading, the field names , and the data for each row:

 #!/usr/bin/perl use strict; use warnings; my $flag = 0; my $table; my @table; my @fieldnames; my @fields; while(<>) {     if(/^\s*$/) {         # skip blank lines         ;     }elsif(/^TABLE\t(\w+)/) {         # output previous table         print(@table) if $flag;         $flag = 1;         # begin new table         @table = (  );         $table = ;         push(@table, "\nTable is $table\n");     } elsif($flag =  = 1) {         @fieldnames = split;         $flag = 2;         push(@table, "Fields are ", join("", @fieldnames), "\n");     } elsif($flag =  = 2) {         @fields = split;         push(@table, join("", @fields) . "\n");     } } # output last table print @table;

This program understands the file format I gave previously, reads it in, and then reformats it and prints it out. It's just an example of how you might read in data. In the following, I'll modify this program to read in the file, but instead of printing out the (reformatted) tables, it sends SQL commands to the MySQL database to insert the data into the appropriate tables.

As you see, this first version of the program uses the $flag variable to keep track of what it's reading. Every time the input line begins with TABLE\t (that \t is a tab that actually shows up as whitespace), the program outputs the previously read table (if $flag indicates there was one). It then saves the next word as the table's name , sets the $flag to 1, and prepares some output in the array @table .

Otherwise, if the $flag variable is set to 1, the program knows it's on the second line of a table (remember, this program is specially written for the input file format I gave previously). In this case, it saves the names of the fields in an array, and then reformats them and adds them to the @table output array.

Finally, if the $flag variable is set to 2, the program knows it's reading rows of the table; it reformats them and adds them to the @table output array.

When all the input is done, and the while loop finishes, there will be the last table's reformatted output ready to be printed from the @table output array.

If I call this program homologs.getdata a nd give it my data file homologs.tabs like so:

 % perl homologs.getdata homologs.tabs

I get the following output:

 Table is ORGANISM Fields are OrgIdOrganism 1human 2worm 3mouse Table is GENES Fields are GeneIdGeneDate 118aging1984-07-13 9223wrinkle1987-08-15 273hairy1990-09-30 Table is VARIANTS Fields are VarIdOrgIdGeneId 11118 22118 319223 439223 53273

Notice that all I've really done here is read in the data and print it out in a slightly different format; among other things, I've changed the delimiter between fields from a tab to a vertical bar, a common type of task with these database dumps. But now, let's see how to interact with an actual database.

6.7.3 DBI Examples

Let's take the homologs.getdata program from the previous section and add the DBI calls to the MySQL database that will populate the MySQL database with the read-in data.

6.7.3.1 homologs.tabs

For starters, let's just see a Perl program that connects to the database, asks a simple question ("What tables are in this database?"), displays the results, and disconnects:

 #!/usr/bin/perl use strict; use warnings; # Make connection with MySQL database use DBI; my $database = 'homologs'; my $server   = 'localhost'; my $user     = 'tisdall'; my $passwd   = 'NOTmyPASSWORD'; my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd); # prepare an SQL statement my $query    = "show tables"; my $sql      = $homologs->prepare($query); # execute an SQL statement $sql->execute(  ); # retrieve and print results while (my $row = $sql->fetchrow_arrayref) {     print join("\t", @$row), "\n"; } # Break connection with MySQL database $homologs->disconnect; exit;

Here's the result of running this program:

 GENES ORGANISM VARIANTS

This program does the basic tasks that all DBI programs have to do, so let's examine them in useful detail.

After the obligatory use DBI; that loads the DBI module, I declare some variables to hold the string that specifies to DBI who is connecting to what and where. The actual connection happens here:

 my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);

The connect method is asked to connect to the particular database (in this case the homologs database) on the local computer ( localhost ; this can be replaced by a URL of another computer) as the user with the given password. mysql is specified; if you are using another DBMS, you should change this to reflect your database system. Other optional arguments, not used here, are possible (see the documentation). The connect method is the DBI new method that creates (and initializes) a DBI object.

The output of the connect call is saved as a new DBI object in the reference variable $homologs .

Next, the DBI object prepares an SQL statement, and the result is saved as a new statement object, which I call $sql here. This statement object $sql then calls its own execute method which actually does the job of sending the SQL to the database.

The actual SQL here is a very simple one. You've already seen that DBI->connect specifies the particular database on your MySQL ( homologs in this case). This SQL statement show tables simply asks for a list of the names of the tables that are defined in that database.

After execute , the program retrieves the results of executing the SQL statement. There are several ways to retrieve results. Here, the statement object method fetchrow_arrayref is called in a loop to fetch all the rows of the result; at each pass through the loop, the return value in $row points to the array of fields in that row. Here, I simply separate the fields with tab characters with the help of the Perl join function on the dereferenced array @$row and print the row with a newline.

The last DBI call is to disconnect from the database. This is actually an important call to make; depending on your implementation and how you're running your program, it is sometimes possible to open a number of connections and eventually tax the MySQL DBMS to the point where it has to refuse any more connect requests . Especially if you have an active database with regular queries coming in, you want to disconnect as soon as possible from each connect .

The program you've just seen is the kind of sample program you should run when you first try to install and use the DBI module. If it works, you're in business. If not, you have to closely examine the error messages you get to identify where the problem is. One thing that can help is to add an additional argument to the connect call that asks for more error reporting, like so:

 my $homologs = DBI->connect(         "dbi:mysql:$database:$server", $user, $passwd, {RaiseError=>1} );

This terminates the program with extra error messages if the connect call fails; this is usually where things go wrong when you first try to use this software. (See the documentation for other such options to connect .) Remember that you need a username and password for MySQL itself, and that these aren't related in any way to the username and password on your computer outside of MySQL. You also need to have the database defined (e.g., the SQL statement create database life creates a database called life ), and you have to have your MySQL permissions set properly to allow you to do the things you want to do, such as create or modify databases, or see other databases on the system. If there's a problem, ask your database administrator, consult your MySQL documentation, or visit the MySQL web site.

6.7.3.2 homologs.load

This next program does a little more than the previous; it reads in the tab-delimited file homologs.tabs , extracts the data, and uses it to load the tables in a MySQL database. Read it over: most of it will be familiar from previous programs in this chapter.

 #!/usr/bin/perl use strict; use warnings; # Make connection with MySQL database use DBI; my $database = 'homologs'; my $server   = 'localhost'; my $user     = 'tisdall'; my $passwd   = 'NOTmyPASSWORD'; my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd); my $sqlinit  = $homologs->prepare("show tables"); $sqlinit->execute(  ); while (my $row = $sqlinit->fetchrow_arrayref) {         print join("\t", @$row), "\n"; } my $flag = 0; my $table; my @tables; my $sql; while(<>) {     # skip blank lines     if(/^\s*$/) {         next;     # begin new table     }elsif(/^TABLE\t(\w+)/) {         $flag = 1;         $table = ;         push(@tables, $table);         # Delete all rows in database table         my $droprows = $homologs->prepare("delete from $table");         $droprows->execute(  );     # get fieldnames, prepare SQL statement     } elsif($flag =  = 1) {         $flag = 2;         my @fieldnames = split;         my $query = "insert into  $table ("                      . join(",", @fieldnames)                      . ") values ("                      . "?, " x (@fieldnames-1)                      . "?)";         $sql = $homologs->prepare($query);     # get row, execute SQL statement     } elsif($flag =  = 2) {         my @fields = split;         $sql->execute( @fields);     } } # Check if tables were updated foreach my $table (@tables) {         my $query = "select * from $table";         my $sql = $homologs->prepare($query);         $sql->execute(  );         while (my $row = $sql->fetchrow_arrayref) {             print join("\t", @$row), "\n";         } } # Break connection with MySQL database $homologs->disconnect; exit;

This program is called by giving it the name of the tab-delimited file on the command line (the same file used previously with the homologs.getdata program):

 % perl homologs.load homologs.tabs

This is the output of the program:

 GENES ORGANISM VARIANTS Table: ORGANISM 1        human 2        worm 3        mouse Table: GENES 118        aging        1984-07-13 9223       wrinkle     1987-08-15 273        hairy        1990-09-30 Table: VARIANTS 1        1        118 2        2        118 3        1        9223 4        3        9223 5        3        273

This homologs.load program is very much like the previous homologs.getdata program. However, instead of building an output array @tables and printing the text to the screen, homologs.load puts the data into the MySQL database using SQL statements. When the database is loaded, it retrieves the data from the tables and prints it to the screen.

Notice that each time homologs.load finds a new table, it first empties all rows of that table in the database and then proceeds to read in the lines of data and insert new rows into the table.

Notice also that when the program reads the line of the input file that names the fields (when $flag equals 1), it prepares the SQL statement, saving the object in the $sql variable. Then, when the actual lines of data are read (when $flag equals 2), the values of the fields are passed to the execute command, which sends the SQL command by the $sql object to the database system.

6.7.3.3 An SQL query

The SQL statement that is the argument to the prepare method is built up from information that the program knows at that point. Here it is again:

 my $query = "insert into  $table ("              . join(",", @fieldnames)              . ") values ("              . "?, " x (@fieldnames-1)              . "?)";

This is a bit hard to read at first sight. However, it is typical of what happens when you use one language (Perl) to make a statement in another language (SQL). So I'll explain this one carefully as a good example of the breed.

The SQL query is formed by five strings that are joined by the dot (.) string operator ”recall that " r " . " DNA " has the value " rDNA ". Here are the five strings being joined:

 "insert into  $table (" join(",", @fieldnames) ") values (" "?, " x (@fieldnames-1) "?)"

The question marks in the SQL statement are bind variables that are passed the values from the @fields array when the execute statement is called with:

 $sql->execute( @fields );

If the query is called with:

 $table = EXONS

and:

 @fieldnames = (Exon, Position)

the resulting SQL statement is:

 "insert into  EXONS (Exon,Position) values (?, ?)"

How does this statement get constructed ? Let's look at it in detail:

The first string just interpolates the table name EXONS into the string.
The second string joins the field names with commas.
The third string appears as is.
The fourth string uses the Perl x string operator to make a new string that has a certain number of copies of the original string " ? , ". The desired number of copies is specified on the right side of the x operator. The string itself is on the left side of the x operator, " ? , ".

@fieldnames in a scalar context returns the number of elements of the @fieldnames array; I need one less " ? , " than that plus an additional question mark without a comma (because no commas are allowed after the last item in the list in SQL). So, @fieldnames-1 is the desired number of copies of the string " ? , ".
The fifth string " ?) " is just the last question mark and the closing parenthesis.

The final result is:

 insert into EXONS (Exon,Position) values (?, ?)

As the Perl program reads in the data rows from the input file (when $flag equals 2), the values are placed in the array @fields and then passed as variables (to take the place of the question marks in the SQL statement that's been prepared) to the execute method, which sends the SQL statement to the database system. These question marks are the bind variables; here, I need one for each field, because I'll pass in the field values when execute is eventually called on this statement.

To check what actually happened to the database after the reading in and processing of the file is complete, the program sends an SQL query for each table to see what's in it. This is done with the SQL select command ”a general-purpose command to get information out of a database that has a great many options.

Here, I'm asking to see all fields (*) from the database table, and because no restrictions are added, SQL shows us all the fields of all the rows. As before, the result of this SQL query is read using the fetchrow_arrayref DBI method in a while loop, and each resulting row is printed with tab-separated fields.

This last program homologs.load is a typical DBI program, interacting with the world through Perl (e.g., reading in files and displaying the results to the user) and also interacting with the database through the Perl DBI module and SQL statements.