Designing a Database | Perl Database Programming

Designing your database tables is partly an art: There are always trade-offs among simplicity, keeping data consistent, and performance. Here are some goals to aim for and guidelines to help you.

Normalization

The steps we go through in this section are called, in database jargon, normalization. Data is in First Normal Form (1NF) when you:

Eliminate repeating groups (the different phone numbers)
Create separate tables for different data: people in a people table, books in a books table, CDs in a CD table
Choose a primary key for each table

The Second Normal Form (2NF) is when you:

Move data that is repeated in several rows to a new table (the address table, in this case)
Add the primary key of the new table to the original table, if it's not already there

Data is Third Normal Form (3NF) when you move independent data on the primary key to separate tables: the city and state tables.

Of course, you have to repeat the whole process for each of the new tables created. At the end of all this, your data should allow no duplication. You will be able to make any query (and get the same answer) that you can on the original unnormalized table.

Eliminate repeating data

Figure 1-7 is a first draft of a table with names, addresses, and phone numbers. This table is simple, but it has a problem. If you ever want to add a new type of number (a mobile phone, a pager, a FAX and so on), the structure will change. The different phone numbers are a 'repeating group' that you need to get rid of somehow.

click to expand
Figure 1-7: A simple table

Figure 1-8 shows another version of the table. Now there is one row for each number, and we can add extra types of phone numbers easily. We've added a phone_type column so we can tell which number is which.

click to expand
Figure 1-8: Table from Figure 1-7 with repeating groups removed

Make each row uniquely identifiable

Most DBMS require that each table have a primary key: one or more columns that are unique-and not NULL-for each row. Uniqueness is usually enforced by creating an index on these columns.

Use primary keys to join related tables by putting the primary key of one table into a related table, where it is called a foreign key. The primary key is the name column in Figure 1-7 and the name and phone_type columns in Figure 1-8.

Some columns are more useful than others as primary keys. Names work only in very small tables: In one office, six people of twelve might be named Chris. Social Security Numbers are often used as primary keys: They should be unique, but apart from the privacy and security issues, you may not know the value; some people (children, foreign visitors, and so on) don't have them.

If a table doesn't have an obvious primary key, a sequential number provided by the DBMS or some other unique number such as a UUID (Universal Unique ID) is the solution.

Eliminate redundant data

The new structure solves one problem: we can add as many numbers as we want. But in many ways it's worse than we started with. Because there is no number for Brigitte, one row has a NULL value in a primary key column. There is also a lot of duplicated data. The duplication causes several problems:

Adding new rows requires reentering existing information.
If someone changes his or her name or address, we must update multiple rows.
If we delete all the phone numbers for someone, we might also lose his or her address unless we're careful.
Where data is duplicated, some copies of it will end up wrong (because of miskeying, or because a value has changed but only some of the copies have been updated).

The next step is to put data that's in multiple rows into a separate table. In Figure 1-9, the data is separated into separate tables based on addresses and the phone numbers.

click to expand
Figure 1-9: Repeated data placed in a separate table

Now we can have any number of phone numbers for each person-including none! We can get the original data back by joining the address and phone tables on the name column. If we want to include people without phone numbers, we must perform an OUTER JOIN.

This is probably as far as we should go with this application, but there are additional useful steps in larger systems.

Eliminate data not dependent on the key

The next step is to move to another table any data that is not directly related to the key of the row. In Figure 1-10, the state has nothing to do with the person's name, but the state is related to the city, so we can create a table with city and state. If all of these rules are applied to all tables recursively, we can end up with a lot of small tables.

click to expand
Figure 1-10: Separate tables for address, city, and state

Because the state name is in multiple rows, it can be moved into its own table. We end up with our single name, address, and phone numbers table split into four. All the data is available; none is repeated unnecessarily, and we can recreate the original single table with joins.

De-Normalization

A problem with normalization, if you follow all the rules, is that you can end up with many small tables. This leads to complex and possibly inefficient queries with many joins. In this simple address and phone example, separate tables for city and state are overkill. Every query for an address needs to join the address, city, and state tables.

If you find yourself in this position, the answer may be de-normalization: adding some redundancy to simplify your application. In this case, putting the state column back into the address table makes the structure much simpler at the cost of typing a state name for each address.

Add indexes for performance

No one likes using a slow computer, and no one wants to wait very long for a database to find the data he or she has asked for. Apart from the DBMS and the computer it runs on, the things that affect performance are good data design and indexes.

For small tables (up to, say, a few hundred rows), indexes may not be necessary: There is an overhead in maintaining and using an index. With larger tables, indexes enable the DBMS to be much more efficient, especially when joining tables.

Add an index for each column, or each group of columns, used to join tables. Specify a unique index when you want to prevent duplicates in a column. Add an index to a column that is searched frequently, so that the DBMS does not have to search sequentially.