6.6 Relational Database Design


Database design is the process of effectively organizing data into tables in a relational database. When thinking about how to organize a database you need to ask, "What fields should I put together into tables, and how should I interrelate the tables?" In this section I will show you a short example that demonstrates some common and useful techniques for doing just that.

Relational database software projects are best broken down into separate stages. This list from Database Systems: a Practical Approach to Design Implementation, and Management (see Section 6.10), shows the typical stages of database design and construction:

Database planning
System definition
Requirements collection and analysis
Database design
DBMS selection
Application design
Prototyping
Implementation
Data conversion and loading
Testing
Operational maintenance

For the small biology lab in which database programming may be a one-person project, some of these stages may be brief and informal, but they still apply.

How should the tables be defined for a new database? The answer depends on the problems to be answered by the data, but it's also largely a matter of common sense and a feel for the data. The database beginner typically looks at the data, tries her hand at a few designs, and begins to get a sense of how tables can be used for a specific problem. Let yourself experiment and try a few alternatives, and you'll soon get the hang of it.

Tables are commonly interrelated by indexing and by joining fields from different tables. The SQL language implemented with your DBMS provides these abilities . Also, a group of techniques called normalization can help you produce a good design and avoid some problems. Simply putting data into tables doesn't guarantee a good design.

A set of rules called normal forms helps you arrange the data into tables in a way that avoids certain problems. One such problem is data redundancy , which is unnecessary duplication of data in different tables. A related problem is update anomalies caused by having the same data in more than one location. When such data is updated, copies may not be updated properly, and the database can become inaccurate.

Here are some simple rules to follow when designing your database:

  • Each entry of each table has a single value. This is first normal form .

  • Each table has a a unique identifier (called a primary key) for each row. This is second normal form .

  • Names aren't used as identifiers because they can lead to data redundancy.

Third and other normal forms as well as other design considerations aren't covered in this book due to space limitations. See Section 6.10 at the end of the book for more information about relational database design. Consider Table 6-3, which shows this alternate, unnormalized version of my homologs database.

Table 6-3. Unnormalized homolog data

Name

Date

Organisms

aging

1984-07-13

human, worm

wrinkle

1987-08-15

human, mouse

hairy

1990-09-30

mouse

This table has multiple values in some of the locations in the table. To put the table into first normal form, I make new rows for each multiple entry (see Table 6-4).

Table 6-4. First normal form homolog data

Name

Date

Organisms

aging

1984-07-13

human

aging

1984-07-13

worm

wrinkle

1987-08-15

human

wrinkle

1987-08-15

mouse

hairy

1990-09-30

mouse

In relational databases, a functional dependency exists between fields for which the value of one field is always associated with no more than one value in a second field. In Table 6-4, the value "wrinkle" in the Name field is associated with Organism "mouse" in one row, and with Organism "human" in another row. This isn't a functional dependency. On the other hand, the value "wrinkle" in the Name field is associated only with the value "1987-08-15" in the Date field. This is a functional dependency. In a real database, you need to ascertain that the test for a functional dependency will hold even as the database is updated; it requires knowledge of the use of the database and the possible range of values of the fields. In Table 6-4, the functional dependencies are:

 Name        ->        Date Date        ->        Name 

The primary key is a minimal group of fields that uniquely identifies each row. In Table 6-4, I can choose the pair of fields Organism/Name or Organism/Date as a primary key. Remember, a field is fully functionally dependent on a primary key if removing any field from the primary key destroys the functional dependency.

Practically speaking, you can ensure second normal form by making sure each table has a field that is a unique identifier: usually this field has integer values such as 1, 2, 3, etc; no row is missing an identifier integer, and no two rows have the same integer. More formally , a relation in first normal form for which every non-primary-key field is fully functionally dependent on the primary key, is in second normal form. The rule-of-thumb solution is to divvy up the fields into new tables, and possibly add some new unique identifier keys. But how exactly do you do that?

Looking over the data, I notice that Gene and Date are always associated with the same values (say they represent a gene name and the date it was first reported ). I can make a new table Genes (Table 6-5) out of them, with Gene as the primary key. I can then make a second table Organism (Table 6-6) from the Organism field, adding an ID field OrgID. Finally, I can make a third table Variants (Table 6-7) with its own field VarID unique for each row, an OrgID field, and a GeneID field. Table 6-7 contains a row for each gene in each organism in the database. Table 6-5 through Table 6-7 show my new design for the homologs database (in second normal form).

Table 6-5. homologs database design in second normal form: Genes

Gene

Date

aging

1984-07-13

wrinkle

1987-08-15

hairy

1990-09-30

Table 6-6. homologs database design in second normal form: Organism

OrgID

Organism

1

human

2

worm

3

mouse

Table 6-7. homologs database design in second normal form: Variants

VarID

OrgID

Gene

1

1

aging

2

2

aging

3

1

wrinkle

4

3

wrinkle

5

3

hairy

Each table contains a field that isn't null, that is unique for each row, and that I can use as a primary key. Such keys are called unique identifiers . These are Gene, OrgID, and VarID, in Table 6-5, Table 6-6, and Table 6-7, respectively. Notice that in each table the other fields are functionally dependent on the unique identifier row. And, finally, it's clear that the definition of "fully functionally dependent," which involves removing fields from a primary key, is satisfied when the primary key has only one field.

My tables are now in second normal form and are much improved. You'll notice, however, that there is still a problem of data redundancy, which can lead to update anomalies. If, for instance, the name of the "aging" gene was changed to "fountain of youth," three changes would have to be made to this database to keep all the data in sync. Using names as unique identifiers is a problem. Sometimes it works, but you must consider the operation of your database over time. In this case, genetic nomenclature may (and does) change. If I use a numeric ID as a unique identifier for a gene name, I can handle a gene name change by making a single update in only one table in my database (see Table 6-8, Table 6-9, and Table 6-10).

Table 6-8. homologs database design: Genes

GeneID

Gene

Date

118

aging

1984-07-13

9223

wrinkle

1987-08-15

273

hairy

1990-09-30

Table 6-9. homologs database design: Organism

OrgID

Organism

1

human

2

worm

3

mouse

Table 6-10. homologs database design: Variants

VarID

OrgID

GeneID

1

1

118

2

2

118

3

1

9223

4

3

9223

5

3

273

Here's an example from my Linux system in which I drop the previous definition of the homologs database in my MySQL RDMS and redefine it with the three tables as just shown. I just define the tables here, but I don't populate them, as that will be done in Section 6.7.2:

 [tisdall@coltrane tisdall]$ mysql -u tisdall -p Enter password: Welcome to the MySQL monitor.  Commands end with ; or ,,. Your MySQL connection id is 9 to server version: 3.23.41 Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> drop database homologs; Query OK, 9 rows affected (0.13 sec) mysql> show databases; +----------+  Database  +----------+  caudyfly   dicty      gadfly     jkl        master     mysql      poetry     rebase     test       yeast     +----------+ 10 rows in set (0.12 sec) mysql> create database homologs; Query OK, 1 row affected (0.00 sec) mysql> use homologs; Database changed mysql> CREATE TABLE organism (           orgid int(11) default NULL,           organism char(20) default NULL        ) TYPE=MyISAM; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE genes (           geneid int(11) default NULL,           gene char(20) default NULL,           date date default NULL        ) TYPE=MyISAM; Query OK, 0 rows affected (0.01 sec) mysql> CREATE TABLE variants (           varid int(11) default NULL,           orgid int(11) default NULL,           geneid int(11) default NULL        ) TYPE=MyISAM; Query OK, 0 rows affected (0.00 sec) mysql> show tables; +--------------------+  Tables_in_homologs  +--------------------+  genes                organism             variants            +--------------------+ 3 rows in set (0.00 sec) mysql> 

I have only briefly introduced some of the techniques useful in the design of a small database. There is, of course, training and skill involved in database design, as well as certain standard (and occasionally competing) methodologies; far more than I have the space to introduce here.



Mastering Perl for Bioinformatics
Mastering Perl for Bioinformatics
ISBN: 0596003072
EAN: 2147483647
Year: 2003
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net