6.4 Structured Query Language | Mastering Perl for Bioinformatics

The Structured Query Language (SQL, pronounced "s q l" or "see quel") can be thought of as the working definition of a relational database. It provides the bioinformatics programmer with the wherewithal to create, populate, interrelate, query, and update a relational database on a computer system. Your DBMS comes with its own implementation of SQL, which will have all the basic commands plus some, or maybe all, the less-used commands and features in the standard definition of the language, perhaps even some special extensions to the standard.

SQL dates back to the 1970s when it was developed at IBM. The most widely used versions of SQL are based on the standard published in 1992 and commonly called SQL2. A newer standard called SQL3 is available and supports emerging database functionality such as object-oriented and object-relational data models. MySQL is based on a subset of the most commonly used parts of SQL2, with the goal of providing a very fast implementation of the key components of SQL. Some features of SQL3 are also being added.

SQL is actually a fairly simple language to learn. Most people find that getting an account established on their computer, reading through a quick tutorial, and then having example code to copy and modify with the SQL documentation close at hand, is enough to get started writing useful SQL code.

I'm not going to present an extensive SQL tutorial here, for three reasons. First, such tutorials are easily and widely available. Second, each DBMS has its own version of SQL, so the DBMS documentation (such as that which comes with MySQL, for example) is necessary and available to you anyway. Third, SQL is such a basically simple language that it's quite useful to learn the basics of it by simply seeing a few examples. That's the approach I'll take.

If you are new to SQL, the best way to get familiar with it is by using the interactive command-line interface to try out different commands. The following section demonstrates my Linux system running MySQL.

6.4.1 SQL Commands

First, I enter the interactive mysql program, providing my MySQL username ("tisdall") and interactively entering my MySQL account password:

 [tisdall@coltrane tisdall]$ mysql -u tisdall -p Enter password:  Welcome to the MySQL monitor.  Commands end with ; or \g. Your MySQL connection id is 2 to server version: 3.23.41 Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

Next , I ask for a list of all the databases that are defined in my MySQL DBMS:

 mysql> show databases; +----------+  Database  +----------+  caudyfly   dicty      gadfly     master     mysql      poetry     yeast     +----------+ 7 rows in set (0.15 sec)

6.4.1.1 Creating a database

I want to create a database called "homologs". First, I create it, then I check that it's there, and finally I make it the active database with use homologs; :

 mysql> create database homologs;   Query OK, 1 row affected (0.00 sec) mysql> show databases; +----------+  Database  +----------+  caudyfly   dicty      gadfly     homologs   master     mysql      poetry     yeast     +----------+ 8 rows in set (0.01 sec) mysql> use homologs; Database changed

6.4.1.2 Creating tables

The next commands create the two tables for the homologs database. Initially they are empty. I ask to see the fields that have been defined with show fields ( show full columns also works):

 mysql> create table genename ( name char(20), id int, date date ); Query OK, 0 rows affected (0.00 sec) mysql> create table organism ( organism char(20), gene char(20) ); Query OK, 0 rows affected (0.00 sec) mysql> show tables; +--------------------+  Tables_in_homologs  +--------------------+  genename             organism            +--------------------+ 2 rows in set (0.00 sec) mysql> show fields from genename; +-------+----------+------+-----+---------+-------+  Field  Type      Null  Key  Default  Extra  +-------+----------+------+-----+---------+-------+  name   char(20)  YES        NULL             id     int(11)   YES        NULL             date   date      YES        NULL            +-------+----------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> show fields from organism; +----------+----------+------+-----+---------+-------+  Field     Type      Null  Key  Default  Extra  +----------+----------+------+-----+---------+-------+  organism  char(20)  YES        NULL             gene      char(20)  YES        NULL            +----------+----------+------+-----+---------+-------+ 2 rows in set (0.00 sec) mysql>

6.4.1.3 Populating the tables

Now, I've got a new database with two tables defined, and I'm ready to populate the tables. First, I verify that the genename table is empty by making a select command; then I issue three insert commands, one for each row that I want to insert in the table. After inserting the rows, I verify that the genename table now has the desired three rows by means of a select command:

 mysql> select * from genename; Empty set (0.00 sec) mysql> insert into genename (name,id,date) values ('aging',118,'1984-07-13'); Query OK, 1 row affected (0.00 sec) mysql> insert into genename (name,id,date) values ('wrinkle',9223,'1987-08-15'); Query OK, 1 row affected (0.00 sec) mysql> insert into genename (name,id,date) values ('hairy',273,'1990-09-30'); Query OK, 1 row affected (0.01 sec) mysql> select * from genename; +---------+------+------------+  name     id    date        +---------+------+------------+  aging     118  1984-07-13   wrinkle  9223  1987-08-15   hairy     273  1990-09-30  +---------+------+------------+ 3 rows in set (0.00 sec)

Now, I repeat the same process to populate the other organism table:

 mysql> show fields from organism; +----------+----------+------+-----+---------+-------+  Field     Type      Null  Key  Default  Extra  +----------+----------+------+-----+---------+-------+  organism  char(20)  YES        NULL             gene      char(20)  YES        NULL            +----------+----------+------+-----+---------+-------+ 2 rows in set (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'human', 118 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'human', 9223 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'mouse', 9223 ); Query OK, 1 row affected (0.01 sec) mysql> insert into organism ( organism, gene ) values ( 'mouse', 273 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'worm', 118 ); Query OK, 1 row affected (0.00 sec) mysql> select * from organism; +----------+------+  organism  gene  +----------+------+  human     118    human     9223   mouse     9223   mouse     273    worm      118   +----------+------+ 5 rows in set (0.00 sec)

Let's find out which organisms have a homolog of the wrinkle gene. My query has two stages. First, I get the ID of the gene and search for it in the ORGANISM table. Then I write it as a single SQL statement:

 mysql> select id from genename where name = 'wrinkle'; +------+  id    +------+  9223  +------+ 1 row in set (0.00 sec) mysql> select organism from organism where gene = 9223; +----------+  organism  +----------+  human      mouse     +----------+ 2 rows in set (0.00 sec) mysql> select organism from organism, genename     -> where genename.name = 'wrinkle' and genename.id = organism.gene; +----------+  organism  +----------+  human      mouse     +----------+ 2 rows in set (0.00 sec) mysql>  mysql>

Notice how the last statement asks the same question as the two preceding statements combined.