38.2. Normalization | MySQL 5.0 Certification Study Guide

Normalization refers to the process of restructuring tables to eliminate design problems. Normalizing your tables removes redundant data, makes it possible to access data more flexibly, and eliminates the possibility that inappropriate modifications will take place that make the data inconsistent. Normalization of a complex table often amounts to taking it through a process of decomposition into a set of smaller tables. This process removes repeating groups within rows and then removes duplicate data within columns.

Normalization has several levels: First normal form, second normal form, and so forth. Each successive form depends on the preceding form and provides stronger guarantees about the data modification anomalies that are eliminated. This section discusses the first through third normal forms. These are the most common, though higher-level forms are possible.

A table is in first normal form (1NF) if it contains no repeating groups within rows.
A table is in second normal form (2NF) if it is in 1NF and every non-key value is fully dependent on the primary key value. The latter constraint means that a non-key value cannot depend only on some columns of the primary key.
A table is in third normal form (3NF) if it is in 2NF and every non-key value depends directly on the primary key and not on some other non-key value.

The normalization process and the problems that it solves can be illustrated using the parts-and-suppliers scenario of which database designers are so fond. The following example serves as a demonstration that begins with a poorly designed table, and then improves the table design in stages until we reach third normal form. At each stage, the SQL statements are shown that produce the desired table modifications.

Begin with a table that is intended to keep track of a parts inventory. It lists supplier numbers, locations, and ZIP codes. For each supplier, it lists part numbers, names, and quantity-on-hand values: The table definition and some sample data follow:

 mysql> CREATE TABLE Inventory     -> (     ->     sno INT,     ->     sloc CHAR(20),     ->     szip CHAR(20),     ->     pno1 INT,     ->     pname1 CHAR(20),     ->     qty1 INT,     ->     pno2 INT,     ->     pname2 CHAR(20),     ->     qty2 INT     -> ); mysql> INSERT INTO Inventory VALUES     -> (1,'Chicago','60632', 1,'stool',5,2,'lamp',15),     -> (2,'Dallas','75206',1,'stool',25,3,'desk',10),     -> (3,'Chicago','60632',2,'lamp',10,4,'chair',3); mysql> SELECT * FROM Inventory ORDER BY sno; +------+---------+-------+------+--------+------+------+--------+------+ | sno  | sloc    | szip  | pno1 | pname1 | qty1 | pno2 | pname2 | qty2 | +------+---------+-------+------+--------+------+------+--------+------+ |    1 | Chicago | 60632 |    1 | stool  |    5 |    2 | lamp   |   15 | |    2 | Dallas  | 75206 |    1 | stool  |   25 |    3 | desk   |   10 | |    3 | Chicago | 60632 |    2 | lamp   |   10 |    4 | chair  |    3 | +------+---------+-------+------+--------+------+------+--------+------+

This table design has several problems:

The structure uses repeating groups. That is, it has multiple columns for similar types of information (parts for a given supplier). Designing a table this way doesn't allow for more than a fixed number of parts per supplier. The table could be altered to include more part columns, but the number of parts remains fixed and another change would be required should more parts need to be listed for a supplier some day.
The design wastes space. Rows for suppliers with fewer than two parts have empty columns. Also, the columns for the second part cannot be declared NOT NULL because they might need to be set to NULL to indicate that a second part is not listed in a record.
It is difficult to formulate efficient queries. To test a condition on part values, you must write an expression that has two terms (one for each part). Such queries are not easy to maintain: If you add more part columns, the query must be rewritten to add terms to conditional expressions.

To put the information in first normal form, the repeating groups must be eliminated. This can be done by creating a table in which rows associate the supplier information with information for only a single part, as follows:

 mysql> CREATE TABLE Inventory2     -> (     ->     sno INT NOT NULL,     ->     sloc CHAR(20) NOT NULL,     ->     szip CHAR(20) NOT NULL,     ->     pno INT NOT NULL,     ->     pname CHAR(20) NOT NULL,     ->     qty INT,     ->     PRIMARY KEY (sno, pno)     -> ); mysql> INSERT INTO Inventory2 (sno, sloc, szip, pno, pname, qty)     -> SELECT sno, sloc, szip, pno1, pname1, qty1 FROM Inventory; mysql> INSERT INTO Inventory2 (sno, sloc, szip, pno, pname, qty)     -> SELECT sno, sloc, szip, pno2, pname2, qty2 FROM Inventory; mysql> SELECT * FROM Inventory2 ORDER BY sno, pno; +-----+---------+-------+-----+-------+------+ | sno | sloc    | szip  | pno | pname | qty  | +-----+---------+-------+-----+-------+------+ |   1 | Chicago | 60632 |   1 | stool |    5 | |   1 | Chicago | 60632 |   2 | lamp  |   15 | |   2 | Dallas  | 75206 |   1 | stool |   25 | |   2 | Dallas  | 75206 |   3 | desk  |   10 | |   3 | Chicago | 60632 |   2 | lamp  |   10 | |   3 | Chicago | 60632 |   4 | chair |    3 | +-----+---------+-------+-----+-------+------+

The Inventory2 table has no repeating groups and is in 1NF. However, it has a lot of redundancy. Each row has supplier number, location, and ZIP code, when only the number is needed to associate a part with its supplier.

As a result, inconsistencies can easily result from updates. Deleting a row for a given part also deletes supplier information. If the part was the only one for the supplier, there is no longer any information in the table about the existence of that supplier. If you want to change a supplier's location or ZIP code, you must change multiple rows. There are also constraints on how you add data. You cannot insert a supplier without having a part for it first.

To fix these problems, split apart the Inventory2 table into separate Supplier and Part tables, and associate each part only with the supplier number:

 mysql> CREATE TABLE Supplier     -> (     ->     sno INT NOT NULL,     ->     sloc CHAR(20) NOT NULL,     ->     szip CHAR(20) NOT NULL,     ->     PRIMARY KEY (sno)     -> ); mysql> CREATE TABLE Part     -> (     ->     sno INT NOT NULL,     ->     pno INT NOT NULL,     ->     pname CHAR(20) NOT NULL,     ->     qty INT NOT NULL,     ->     PRIMARY KEY (sno, pno)     -> ); mysql> INSERT INTO Supplier     -> SELECT DISTINCT sno, sloc, szip FROM Inventory2; mysql> INSERT INTO Part (sno, pno, pname, qty)     -> SELECT sno, pno, pname, qty FROM Inventory2; mysql> SELECT * FROM Supplier ORDER BY sno; +-----+---------+-------+ | sno | sloc    | szip  | +-----+---------+-------+ |   1 | Chicago | 60632 | |   2 | Dallas  | 75206 | |   3 | Chicago | 60632 | +-----+---------+-------+ mysql> SELECT * FROM Part ORDER BY sno, pno; +-----+-----+-------+-----+ | sno | pno | pname | qty | +-----+-----+-------+-----+ |   1 |   1 | stool |   5 | |   1 |   2 | lamp  |  15 | |   2 |   1 | stool |  25 | |   2 |   3 | desk  |  10 | |   3 |   2 | lamp  |  10 | |   3 |   4 | chair |   3 | +-----+-----+-------+-----+

The key for the Supplier table is the supplier number. The key for the Part table is a composite key based on both supplier and part number. (The key for parts must include the supplier number because a given part might be available from more than one supplier.)

The Supplier table is 2NF because it is in 1NF and every non-key column depends on the primary key (the supplier number). On the other hand, the Part table is not in 2NF. Although the quantity depends on the entire composite primary key (the sno and pno columns), the part name depends only on the part number. This requires a further modification to split the name information off into another table:

 mysql> CREATE TABLE PartName     -> (     ->     pno INT NOT NULL,     ->     pname CHAR(20) NOT NULL,     ->     PRIMARY KEY (pno)     -> ); mysql> INSERT INTO PartName (pno, pname)     -> SELECT DISTINCT pno, pname FROM Part; mysql> ALTER TABLE Part DROP pname; mysql> SELECT * FROM Part ORDER BY sno, pno; +-----+-----+-----+ | sno | pno | qty | +-----+-----+-----+ |   1 |   1 |   5 | |   1 |   2 |  15 | |   2 |   1 |  25 | |   2 |   3 |  10 | |   3 |   2 |  10 | |   3 |   4 |   3 | +-----+-----+-----+ mysql> SELECT * FROM PartName ORDER BY pno; +-----+-------+ | pno | pname | +-----+-------+ |   1 | stool | |   2 | lamp  | |   3 | desk  | |   4 | chair | +-----+-------+

At this point, the Part and PartName are in 3NF because they are in 2NF and each non-key value depends directly on the primary key and not on some other non-key value. However, the Supplier table is in 2NF but not 3NF because there is a transitive dependency. The szip column depends on the primary key, but not directly: It depends on sloc, which depends on the primary key. This allows certain updates to cause problems. For example, you cannot add information about locations and ZIP codes without having a supplier for the given location.

To place the supplier information in 3NF, it's necessary to split out the ZIP code from the Supplier table and create a table that maps supplier location to ZIP code. The resulting Supplier and SupplierZip tables have the following structure.

 mysql> CREATE TABLE SupplierZip     -> (     ->    sloc CHAR(20) NOT NULL,     ->    szip CHAR(20) NOT NULL,     ->    PRIMARY KEY (sloc)     -> ); mysql> INSERT INTO SupplierZip SELECT DISTINCT sloc, szip FROM Supplier; mysql> ALTER TABLE Supplier DROP szip; mysql> SELECT * FROM Supplier ORDER BY sno; +-----+---------+ | sno | sloc    | +-----+---------+ |   1 | Chicago | |   2 | Dallas  | |   3 | Chicago | +-----+---------+ mysql> SELECT * FROM SupplierZip ORDER BY sloc; +---------+-------+ | sloc    | szip  | +---------+-------+ | Chicago | 60632 | | Dallas  | 75206 | +---------+-------+

Now Supplier and SupplierZip both are in 3NF because no non-key column depends on another non-key column. To modify any non-key value, the row to modify can be identified uniquely by referring to the primary key.

The original non-normal Inventory table has been decomposed into a set of normalized tables. For normalization to be correct, it must result in no loss of data. That is, it must be possible to reconstruct the original data by joining the normalized tables. Let's check that:

 mysql> SELECT S.sno, S.sloc, SZ.szip, P.pno, PT.pname, P.qty     -> FROM Supplier S, SupplierZip SZ, Part P, PartName PT     -> WHERE S.sloc = SZ.sloc AND S.sno = P.sno AND P.pno = PT.pno     -> ORDER BY S.sno, P.pno; +-----+---------+-------+-----+-------+-----+ | sno | sloc    | szip  | pno | pname | qty | +-----+---------+-------+-----+-------+-----+ |   1 | Chicago | 60632 |   1 | stool |   5 | |   1 | Chicago | 60632 |   2 | lamp  |  15 | |   2 | Dallas  | 75206 |   1 | stool |  25 | |   2 | Dallas  | 75206 |   3 | desk  |  10 | |   3 | Chicago | 60632 |   2 | lamp  |  10 | |   3 | Chicago | 60632 |   4 | chair |   3 | +-----+---------+-------+-----+-------+-----+

That is indeed the same as the contents of the original Inventory table.

A summary of normalization benefits:

Tables do not contain redundant data. One result is reduced storage requirements due to elimination of duplicate values. Another result is better data integrity when updates are performed due to the reduced chance of updating one instance of a value but not others when all are to be changed, or updating too many values when only one is to be changed.
Individual tables become smaller, which improves performance in various ways. For example, index creation is faster, and table locks don't lock as much data so concurrency is better due to reduced contention.
Normalization makes it easier to identify specific objects uniquely. Breaking data into multiple tables provides the flexibility to combine information in different ways (using joins) more easily.
Normalized tables make it easier to write better joins, so the optimizer works better. There are fewer indexes per table, so the optimizer doesn't have to consider as many execution plans. This helps both retrievals and updates.