Using ALTER TABLE to Normalize a Table | Modifying Tables with ALTER TABLE

8.10.1 Problem

You have a table that's not in normal form.

8.10.2 Solution

ALTER TABLE can help you normalize it.

8.10.3 Discussion

The preceding sections describe how to use ALTER TABLE in fairly general terms. This section shows a more concrete application of the statement by demonstrating how ALTER TABLE can help you redesign tables that you discover to contain unnecessarily redundant data and therefore to be in non-normal form.

Suppose you have a client_billing table for recording billable items that is defined as follows:

CREATE TABLE client_billing
(
 id INT UNSIGNED NOT NULL, # client ID number
 name CHAR(20) NOT NULL, # client name
 address CHAR(20) NOT NULL, # client adddress
 date DATE NOT NULL, # billable item date
 minutes INT NOT NULL, # number of billable minutes
 description CHAR(60) NOT NULL # what was done
);

When you have just one row of information per client, the table looks perfectly fine:

+----+------+---------------+------------+---------+------------------+
| id | name | address | date | minutes | description |
+----+------+---------------+------------+---------+------------------+
| 21 | John | 46 North Ave. | 2001-07-15 | 48 | consult by phone |
| 43 | Toby | 123 Elm St. | 2001-07-13 | 12 | office visit |
+----+------+---------------+------------+---------+------------------+

But when you add more records and have multiple rows per client, it becomes apparent that some of the information is redundant. In particular, names and addresses are stored in every record, even though you really need that information only once for any given client:

+----+------+---------------+------------+---------+------------------+
| id | name | address | date | minutes | description |
+----+------+---------------+------------+---------+------------------+
| 21 | John | 46 North Ave. | 2001-07-15 | 48 | consult by phone |
| 21 | John | 46 North Ave. | 2001-07-19 | 120 | court appearance |
| 43 | Toby | 123 Elm St. | 2001-07-13 | 12 | office visit |
| 43 | Toby | 123 Elm St. | 2001-07-14 | 60 | draft proposal |
| 43 | Toby | 123 Elm St. | 2001-07-16 | 180 | present proposal |
+----+------+---------------+------------+---------+------------------+

The way to fix the problem is to split the information into two tables and associate the records in them using the id values:

One table (client_info) holds the information that is unique to each client, one row per client: the ID, name, and address.
The other table (bill_item) holds information about each billable item: date, number of minutes, and the description of what was done. Each row also should include the client ID number so that the item can be associated with the proper client_info table record.

In other words, the client information from the client_billing table can be separated as follows into the client_info and bill_item tables:

client_info table:

+----+------+---------------+
| id | name | address |
+----+------+---------------+
| 21 | John | 46 North Ave. |
| 43 | Toby | 123 Elm St. |
+----+------+---------------+

bill_item table:

+----+------------+---------+------------------+
| id | date | minutes | description |
+----+------------+---------+------------------+
| 21 | 2001-07-15 | 48 | consult by phone |
| 21 | 2001-07-19 | 120 | court appearance |
| 43 | 2001-07-13 | 12 | office visit |
| 43 | 2001-07-14 | 60 | draft proposal |
| 43 | 2001-07-16 | 180 | present proposal |
+----+------------+---------+------------------+

To accomplish this redesign, first create the client_info and bill_item tables using the following statements, where each table's columns are defined the same way as the corresponding columns from the original client_billing table:

CREATE TABLE client_info
(
 id INT UNSIGNED NOT NULL, # client ID number
 name CHAR(20) NOT NULL, # client name
 address CHAR(20) NOT NULL # client adddress
);
CREATE TABLE bill_item
(
 id INT UNSIGNED NOT NULL, # client ID number
 date DATE NOT NULL, # billable item date
 minutes INT NOT NULL, # number of billable minutes
 description CHAR(60) NOT NULL # what was done
);

Next, use INSERT INTO ... SELECT to copy the appropriate columns from the client_billing table into the two new tables. For the client_info table, the client information can be copied like this:

INSERT INTO client_info (id,name,address)
SELECT id,name,address FROM client_billing;

Copying information to the bill_item table is similar:

INSERT INTO bill_item (id,date,minutes,description)
SELECT id,date,minutes,description FROM client_billing;

Records in the two new tables are associated based on id values, so it's a good idea to index that column in each table to allow them to be related efficiently. However, that's not simply a matter of issuing an ALTER TABLE tbl_name ADD INDEX (id) statement for each table. For one thing, the client_info table at this point has multiple records per client that should be collapsed down to a single record for each client. That means we should create a PRIMARY KEY or UNIQUE index on the id column, using the IGNORE keyword to tell MySQL to discard duplicate records. In addition, queries on the bill_item table are likely often to be date-based, so we can include the date column in the index. ALTER TABLE statements to create these indexes look like this:

ALTER IGNORE TABLE client_info ADD PRIMARY KEY (id);
ALTER TABLE bill_item ADD INDEX (id, date);

After performing the procedure just described, the client_billing table no longer is needed and can be removed:

DROP TABLE client_billing;

With billing records stored in multiple tables, queries to retrieve information become somewhat more complex, but relating tables is what relational databases are good at, after all. For example, to show the client name and address from the client_info table along with a sum of the billable minutes listed per client in the bill_item table, the query looks like this:

mysql> SELECT client_info.id, client_info.name, client_info.address,
 -> SUM(bill_item.minutes) AS 'total minutes'
 -> FROM client_info, bill_item
 -> WHERE client_info.id = bill_item.id
 -> GROUP BY client_info.id;
+----+------+---------------+---------------+
| id | name | address | total minutes |
+----+------+---------------+---------------+
| 21 | John | 46 North Ave. | 168 |
| 43 | Toby | 123 Elm St. | 252 |
+----+------+---------------+---------------+

Multiple-table queries are discussed further in Chapter 12.

The preceding example illustrates how to normalize a table for which data values are repeated across rows. Another kind of non-normal form occurs when you have multiple columns within rows that record the same kind of information. For example, if you're running a study that involves administering two tests to subjects and recording the date and score for each test, you might use a table that has the following structure:

CREATE TABLE test_subject
(
 id INT UNSIGNED NOT NULL AUTO_INCREMENT,
 name CHAR(20) NOT NULL, # subject name
 date1 DATE, # date and score of first test
 result1 INT,
 date2 DATE, # score of second test
 result2 INT,
 PRIMARY KEY (id)
);

Information in the table looks like this:

+----+--------+------------+---------+------------+---------+
| id | name | date1 | result1 | date2 | result2 |
+----+--------+------------+---------+------------+---------+
| 1 | Fred | 2001-07-13 | 78 | 2001-07-14 | 85 |
| 2 | Barry | 2001-07-12 | 79 | 2001-07-14 | 82 |
| 3 | Portia | 2001-07-16 | 82 | 2001-07-18 | 95 |
+----+--------+------------+---------+------------+---------+

When the information is contained in a table of this form, if you want to do something such as computing each subject's average score, you can do so with the following query:

SELECT id, name, (result1 + result2) / 2 FROM test_subject;

If you decide you need to add a third test, and a fourth, you can add columns to the table easily:

ALTER TABLE test_subject
ADD date3 DATE, ADD result3 INT,
ADD date4 DATE, ADD result4 INT;

But the query to compute average scores becomes more complex:

SELECT id, name, (result1 + result2 + result3 + result4) / 4
FROM test_subject;

In fact, if scores can be NULL to signify that a subject hasn't taken a test yet, the query is much messier than that, because you must add only the non-NULL values to avoid getting a NULL sum. This ballooning complexity is a clue that the table design could be improved. The problem is that the structure of the table is dependent on the number of tests you plan to give. If you change that number, you must change the table. A way around this problem is to use two tables, structured so that they need not be changed if you decide to administer additional tests:

The first table records information unique to each subject.
The second table records test results, one row per test. Each row can be associated with the proper subject using the subject ID number.

This is very similar to the approach used in the preceding example, where we used two tables to hold client billing information. However, instead of creating two new tables and then removing the original table, we'll use a different procedure that preserves the original table, albeit in modified form:

Create a table test_result to hold test dates and scores, as well as the subject ID number. To allow scores to be ordered, we can also add a test number column. (The date might serve for ordering, unless it's possible for subjects to take multiple tests on the same day. An explicit test number does not have that problem.)
Copy the subject ID, date, and score information from test_subject to test_result.
Remove the date and score columns from test_subject, leaving only subject IDs and names.

Begin by creating a table to hold test results:

CREATE TABLE test_result
(
 id INT UNSIGNED NOT NULL,
 test_num INT NOT NULL,
 date DATE,
 result INT
);

Copy the test information from test_subject to test_result. It's necessary to do this separately for each test, because each result is copied from a different set of columns:

INSERT INTO test_result (id,test_num,date,result)
SELECT id,1,date1,result1 FROM test_subject WHERE result1 IS NOT NULL;

INSERT INTO test_result (id,test_num,date,result)
SELECT id,2,date2,result2 FROM test_subject WHERE result2 IS NOT NULL;

Each INSERT INTO ... SELECT statement specifies the test_num value "manually," because the appropriate value is not present in the test_subject table and cannot be derived directly from its contents. The WHERE clause serves to copy only rows with non-NULL test results. This handles the possibility that your test_subject records might be incomplete, such that a NULL score indicates a subject hasn't taken a given test. (If the test_subject table has no such missing test results, the WHERE clauses are unnecessary and can be omitted.)

The test_result table is populated now and can be indexed. Note that although the id column is a unique-valued column in the test_subject table, it cannot be so in test_result because there will be multiple rows per subject. However, it's still possible to create a unique index by using both id and test_num, assuming each test is given to each subject only once:

ALTER TABLE test_result ADD PRIMARY KEY (id, test_num);

The test result columns in test_subject are now unneeded and can be removed:

ALTER TABLE test_subject DROP date1, DROP result1, DROP date2, DROP result2;

The advantage of using a two-table layout like this is that queries to perform operations such as score-averaging become independent of the number of tests administered:

SELECT id, AVG(result) FROM test_result GROUP BY id;

To show subject names, too, join the test_result table to the test_subject table:

SELECT test_result.id, test_subject.name, AVG(test_result.result)
FROM test_subject, test_result
WHERE test_subject.id = test_result.id
GROUP BY test_result.id;

You also can easily identify subjects who have not completed all the tests. For example, if you've presented a total of four tests, you can find subjects who don't have four scores in the test_result table like this:

SELECT test_subject.id, test_subject.name, COUNT(test_result.result) AS count
FROM test_subject LEFT JOIN test_result ON test_subject.id = test_result.id
GROUP BY test_subject.id
HAVING count < 4;

This query uses LEFT JOIN to make sure that a count for every subject is generated, even those to whom no tests have yet been administered. (A regular join will fail to identify a subject who has a record in the test_subject table but no records yet in the test_result table, because it will find no match between the tables.) LEFT JOIN is discussed in Chapter 12.

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References