Normalization refers to the process of restructuring tables to eliminate design problems. Normalizing your tables removes redundant data, makes it possible to access data more flexibly, and eliminates the possibility that inappropriate modifications will take place that make the data inconsistent. Normalization of a complex table often amounts to taking it through a process of decomposition into a set of smaller tables. This process removes repeating groups within rows and then removes duplicate data within columns. Normalization has several levels: First normal form, second normal form, and so forth. Each successive form depends on the preceding form and provides stronger guarantees about the data modification anomalies that are eliminated. This section discusses the first through third normal forms. These are the most common, though higher-level forms are possible.
The normalization process and the problems that it solves can be illustrated using the parts-and-suppliers scenario of which database designers are so fond. The following example serves as a demonstration that begins with a poorly designed table, and then improves the table design in stages until we reach third normal form. At each stage, the SQL statements are shown that produce the desired table modifications. Begin with a table that is intended to keep track of a parts inventory. It lists supplier numbers, locations, and ZIP codes. For each supplier, it lists part numbers, names, and quantity-on-hand values: The table definition and some sample data follow: mysql> CREATE TABLE Inventory -> ( -> sno INT, -> sloc CHAR(20), -> szip CHAR(20), -> pno1 INT, -> pname1 CHAR(20), -> qty1 INT, -> pno2 INT, -> pname2 CHAR(20), -> qty2 INT -> ); mysql> INSERT INTO Inventory VALUES -> (1,'Chicago','60632', 1,'stool',5,2,'lamp',15), -> (2,'Dallas','75206',1,'stool',25,3,'desk',10), -> (3,'Chicago','60632',2,'lamp',10,4,'chair',3); mysql> SELECT * FROM Inventory ORDER BY sno; +------+---------+-------+------+--------+------+------+--------+------+ | sno | sloc | szip | pno1 | pname1 | qty1 | pno2 | pname2 | qty2 | +------+---------+-------+------+--------+------+------+--------+------+ | 1 | Chicago | 60632 | 1 | stool | 5 | 2 | lamp | 15 | | 2 | Dallas | 75206 | 1 | stool | 25 | 3 | desk | 10 | | 3 | Chicago | 60632 | 2 | lamp | 10 | 4 | chair | 3 | +------+---------+-------+------+--------+------+------+--------+------+ This table design has several problems:
To put the information in first normal form, the repeating groups must be eliminated. This can be done by creating a table in which rows associate the supplier information with information for only a single part, as follows: mysql> CREATE TABLE Inventory2 -> ( -> sno INT NOT NULL, -> sloc CHAR(20) NOT NULL, -> szip CHAR(20) NOT NULL, -> pno INT NOT NULL, -> pname CHAR(20) NOT NULL, -> qty INT, -> PRIMARY KEY (sno, pno) -> ); mysql> INSERT INTO Inventory2 (sno, sloc, szip, pno, pname, qty) -> SELECT sno, sloc, szip, pno1, pname1, qty1 FROM Inventory; mysql> INSERT INTO Inventory2 (sno, sloc, szip, pno, pname, qty) -> SELECT sno, sloc, szip, pno2, pname2, qty2 FROM Inventory; mysql> SELECT * FROM Inventory2 ORDER BY sno, pno; +-----+---------+-------+-----+-------+------+ | sno | sloc | szip | pno | pname | qty | +-----+---------+-------+-----+-------+------+ | 1 | Chicago | 60632 | 1 | stool | 5 | | 1 | Chicago | 60632 | 2 | lamp | 15 | | 2 | Dallas | 75206 | 1 | stool | 25 | | 2 | Dallas | 75206 | 3 | desk | 10 | | 3 | Chicago | 60632 | 2 | lamp | 10 | | 3 | Chicago | 60632 | 4 | chair | 3 | +-----+---------+-------+-----+-------+------+ The Inventory2 table has no repeating groups and is in 1NF. However, it has a lot of redundancy. Each row has supplier number, location, and ZIP code, when only the number is needed to associate a part with its supplier. As a result, inconsistencies can easily result from updates. Deleting a row for a given part also deletes supplier information. If the part was the only one for the supplier, there is no longer any information in the table about the existence of that supplier. If you want to change a supplier's location or ZIP code, you must change multiple rows. There are also constraints on how you add data. You cannot insert a supplier without having a part for it first. To fix these problems, split apart the Inventory2 table into separate Supplier and Part tables, and associate each part only with the supplier number: mysql> CREATE TABLE Supplier -> ( -> sno INT NOT NULL, -> sloc CHAR(20) NOT NULL, -> szip CHAR(20) NOT NULL, -> PRIMARY KEY (sno) -> ); mysql> CREATE TABLE Part -> ( -> sno INT NOT NULL, -> pno INT NOT NULL, -> pname CHAR(20) NOT NULL, -> qty INT NOT NULL, -> PRIMARY KEY (sno, pno) -> ); mysql> INSERT INTO Supplier -> SELECT DISTINCT sno, sloc, szip FROM Inventory2; mysql> INSERT INTO Part (sno, pno, pname, qty) -> SELECT sno, pno, pname, qty FROM Inventory2; mysql> SELECT * FROM Supplier ORDER BY sno; +-----+---------+-------+ | sno | sloc | szip | +-----+---------+-------+ | 1 | Chicago | 60632 | | 2 | Dallas | 75206 | | 3 | Chicago | 60632 | +-----+---------+-------+ mysql> SELECT * FROM Part ORDER BY sno, pno; +-----+-----+-------+-----+ | sno | pno | pname | qty | +-----+-----+-------+-----+ | 1 | 1 | stool | 5 | | 1 | 2 | lamp | 15 | | 2 | 1 | stool | 25 | | 2 | 3 | desk | 10 | | 3 | 2 | lamp | 10 | | 3 | 4 | chair | 3 | +-----+-----+-------+-----+ The key for the Supplier table is the supplier number. The key for the Part table is a composite key based on both supplier and part number. (The key for parts must include the supplier number because a given part might be available from more than one supplier.) The Supplier table is 2NF because it is in 1NF and every non-key column depends on the primary key (the supplier number). On the other hand, the Part table is not in 2NF. Although the quantity depends on the entire composite primary key (the sno and pno columns), the part name depends only on the part number. This requires a further modification to split the name information off into another table: mysql> CREATE TABLE PartName -> ( -> pno INT NOT NULL, -> pname CHAR(20) NOT NULL, -> PRIMARY KEY (pno) -> ); mysql> INSERT INTO PartName (pno, pname) -> SELECT DISTINCT pno, pname FROM Part; mysql> ALTER TABLE Part DROP pname; mysql> SELECT * FROM Part ORDER BY sno, pno; +-----+-----+-----+ | sno | pno | qty | +-----+-----+-----+ | 1 | 1 | 5 | | 1 | 2 | 15 | | 2 | 1 | 25 | | 2 | 3 | 10 | | 3 | 2 | 10 | | 3 | 4 | 3 | +-----+-----+-----+ mysql> SELECT * FROM PartName ORDER BY pno; +-----+-------+ | pno | pname | +-----+-------+ | 1 | stool | | 2 | lamp | | 3 | desk | | 4 | chair | +-----+-------+ At this point, the Part and PartName are in 3NF because they are in 2NF and each non-key value depends directly on the primary key and not on some other non-key value. However, the Supplier table is in 2NF but not 3NF because there is a transitive dependency. The szip column depends on the primary key, but not directly: It depends on sloc, which depends on the primary key. This allows certain updates to cause problems. For example, you cannot add information about locations and ZIP codes without having a supplier for the given location. To place the supplier information in 3NF, it's necessary to split out the ZIP code from the Supplier table and create a table that maps supplier location to ZIP code. The resulting Supplier and SupplierZip tables have the following structure. mysql> CREATE TABLE SupplierZip -> ( -> sloc CHAR(20) NOT NULL, -> szip CHAR(20) NOT NULL, -> PRIMARY KEY (sloc) -> ); mysql> INSERT INTO SupplierZip SELECT DISTINCT sloc, szip FROM Supplier; mysql> ALTER TABLE Supplier DROP szip; mysql> SELECT * FROM Supplier ORDER BY sno; +-----+---------+ | sno | sloc | +-----+---------+ | 1 | Chicago | | 2 | Dallas | | 3 | Chicago | +-----+---------+ mysql> SELECT * FROM SupplierZip ORDER BY sloc; +---------+-------+ | sloc | szip | +---------+-------+ | Chicago | 60632 | | Dallas | 75206 | +---------+-------+ Now Supplier and SupplierZip both are in 3NF because no non-key column depends on another non-key column. To modify any non-key value, the row to modify can be identified uniquely by referring to the primary key. The original non-normal Inventory table has been decomposed into a set of normalized tables. For normalization to be correct, it must result in no loss of data. That is, it must be possible to reconstruct the original data by joining the normalized tables. Let's check that: mysql> SELECT S.sno, S.sloc, SZ.szip, P.pno, PT.pname, P.qty -> FROM Supplier S, SupplierZip SZ, Part P, PartName PT -> WHERE S.sloc = SZ.sloc AND S.sno = P.sno AND P.pno = PT.pno -> ORDER BY S.sno, P.pno; +-----+---------+-------+-----+-------+-----+ | sno | sloc | szip | pno | pname | qty | +-----+---------+-------+-----+-------+-----+ | 1 | Chicago | 60632 | 1 | stool | 5 | | 1 | Chicago | 60632 | 2 | lamp | 15 | | 2 | Dallas | 75206 | 1 | stool | 25 | | 2 | Dallas | 75206 | 3 | desk | 10 | | 3 | Chicago | 60632 | 2 | lamp | 10 | | 3 | Chicago | 60632 | 4 | chair | 3 | +-----+---------+-------+-----+-------+-----+ That is indeed the same as the contents of the original Inventory table. A summary of normalization benefits:
|