6.3 Relational Database Definitions | Mastering Perl for Bioinformatics

A relational database is essentially a set of tables ( relations ) that are interrelated in certain ways. A table is a two-dimensional matrix with a name . Each table is composed of rows ( tuples ), and no two rows can have exactly the same values. Each row is composed of named fields ( attributes ); each field has a certain data type and a name, and a field can only contain one value. ^[3]

^[3] As the name "relation" suggests to the mathematically inclined, each table represents a subset of the Cartesian product of the domains of the fields, in which each row is an element of the relation. However, the order of the fields in a table is not significant (because each field has a unique name), and that's an important departure from the standard set- theoretic definition of a relation.

For instance, a table may be defined with the fields Name as a character string of at most 50 characters , an ID as an integer, and a Date as a special date datatype. Each row then has a Name, ID, and Date value. Table 6-1 and Table 6-2 show a table called genename with three fields and three rows, and one called organism with two fields and five rows.

Table 6-1. genename

Name	ID	Date
aging	118	1984-07-13
wrinkle	9223	1987-08-15
hairy	273	1990-09-30

Table 6-2. organism

Organism	Gene
human	118
human	9223
mouse	9223
mouse	273
worm	118

You'll notice that the tables each have a field with the same set of values as the other; ID in Table 6-1 and Gene in Table 6-2 use values from the same domain. (A typical domain is the set of positive integers, for example.) Such shared fields can be used to join information from two or more tables. For instance, given a gene Name from the Table 6-1, I can find its associated ID, and look for that value in the Gene field of the Table 6-2 to find what organism or organisms have a version of that gene.

In Table 6-1, the wrinkle gene has ID value 9223, and in Table 6-2, there are two entries in field Gene with value 9223, namely human and mouse.

Each row in a relation is unique and can therefore serve as its own unique identifier for the row. There may also be some smaller group of fields that, together, are unique for each row and for which removing any field destroys the uniqueness; such a group is a candidate key . Very often tables are defined so that each row has its own unique ID field (it may be called something besides ID) that alone may serve as a key; this usually is recommended.

In any event, some candidate key is designated as the primary key for the table.

If in another table there is a field with the same domain as the primary key, it can relate the information in the two tables. That field, in the other table, is called a foreign key , and the primary key's table is called the foreign key's home relation . In MySQL, foreign keys aren't specified as such; they are used only in joins, as you'll see.

In defining fields, you may also specify that no row has a null (undefined) entry in a field; this is required for primary key fields. Similarly, a foreign key is required to refer to an actually existing value in some field in its home relation. These constraints on the data are known as entity integrity and referential integrity .

Using some SQL statements, you can write a (very short) program that, given a gene name, returns the list of organisms in which a version of that gene is found (in this database). In fact, I'll do just that in the next section.

That's simple enough. However, there is a bit more to learn about designing tables that work well, implementing the tables in SQL, and writing the (Perl) programs that send SQL statements to the DBMS and that compute and display the results. These areas of database design, implementation, and application development, are often staffed by specialists in large projects. Each specialization has its own techniques and lore; each can be a full-time job, and there are even more areas of specialization than these. Most bioinformatics programmers know enough about each area to design and implement a web-based, database-driven interface to their laboratory, which is the basic skill set that I'm attempting to impart. Assuming you are a beginner interested in learning the ropes , the following sections and chapters will hopefully give you enough of a jump start to be able to tackle all these tasks for a small project and prepare you to attempt bigger jobs.