What Is Normalization?

Publisher William F. Buckley Jr. told a wonderful story about a waiter who had a complaint about Buckley's journal, National Review. "I love reading the articles," said the waiter. "But why do you have to use so many big words?"

A year later, Buckley met the same waiter, who thanked Buckley for taking his advice and dumbing down the vocabulary. Of course, Buckley had done no such thing. The waiter had learned the "big words" that had given him such trouble, so they were longer "big."

The database term normalization reminds me of that story. The word itself sounds peculiarly technical and opaque. The database theory that it defines is difficult and, for many, impenetrable. (This is by no means a put-down: Database theory is absolutely essential and fundamental to what database designers do; it is just hard to understand.)

But if you look beyond the technical language in which the theory is expressed, you'll find a body of principle that makes much sense. In fact, if you've read and understood this chapter, I think you'll be able to understand the most important rules of normalization. This section will actually serve as a review of what you already know.

An exposure to formal database theory and normalization won't necessarily improve your database design skills. But you will come across some of its terms and rules, and you shouldn't feel intimidated by them (well, no more than necessary). Hopefully this short discussion will put you in the mind-set of Mr. Buckley's waiter: Normalization just won't seem like a big word anymore.

As you read through the chapter, you probably noted that much of your work involved excluding duplicate values. The elimination of repeating data is the overall purpose of creating a normal form. The process of creating a database involves satisfying various levels of the normal form. You start at the first normal form, make sure your database satisfies its requirements, proceed to the second normal form, satisfy its requirements, and so on. As you satisfy each requirement, you eliminate redundancy and set up your data so that it can be processed more efficiently and so that it also is less likely to become inconsistent.

As I've indicated, each level of the normal form is stated in dense theoretical language. I try to describe them in simple language. Although there are several levels of the normal form, the first three are generally considered the most important.

First Normal Form

To meet the requirements of the first normal form, you move repeating and multivalue fields to another table. You've done that throughout the chapter. For example, consider a Books field (or fields) in an Authors table; the field(s) contain the authors' major works. Each author has written several books.

There are two commonsense approaches to resolving this issue. Table 3.35 shows the "repeated field" approach: You create as many additional columns as you need, one for each book. (The author names are not part of the table and are included for information purposes only.) Table 3.36 shows the "multivalue" approach: You stuff all the values (books) into one field.

Table 3.35. The "Repeated Fields" Approach to Storing Values
AuthorID	Book 1	Book 2	Book 3
1 (Dostoevsky)	The Brothers Karamazov	The Gambler	Crime and Punishment
2 (Turgenev)	Fathers and Sons
3 (Tolstoy)	War and Peace	Anna Karenina

Table 3.36. The "Multivalue" Approach to Storing Values
AuthorID	Books	<<other fields>>
1 (Dostoevsky)	The Brothers Karamazov, The Gambler, Crime and Punishment
2 (Turgenev)	Fathers and Sons
3 (Tolstoy)	War and Peace, Anna Karenina

Neither approach is used in relational databases. As you've seen, a solution is to create an entirely new table (see Table 3.37). You copy the primary key from the original table and include it as a foreign key. Each book is now a separate value. Each record comprises the value of the primary key of the original table and the book name.

Table 3.37. A Resolved Multivalue Field
AuthorID	Book
1 (Dostoevsky)	The Brothers Karamazov
1 (Dostoevsky)	The Gambler
1 (Dostoevsky)	Crime and Punishment
2 (Turgenev)	Fathers and Sons
3 (Tolstoy)	War and Peace
3 (Tolstoy)	Anna Karenina

This is just what you did much earlier in the chapter when we removed TV characters from the Programs table and placed them in a separate table. You placed the primary key of the Programs table, along with the character names, in an entirely new table. This new table is the child table, and the original table is the parent table.

Your work is by no means finished: You still have to decide on a primary key for the new table, identify what other fields will be included (if any), and determine how the table fits into the web of relationships that bind the database together. But you've made an important start.

Second Normal Form

To comply with the second normal form, you need to remove fields that do not depend on the entire primary key. In other words, each column should be kept with the table it describes.

Table 3.38 is similar to the Roles table you've seen, but with one addition and one assumption. The addition is that I've added a Network field. The assumption is that there is no separate Networks table. (Again, solely for illustrative purposes, I've included the program and actor names for each ID in parentheses.)

Table 3.38. The Values in the `Network` Field Are Repeated Unnecessarily Because the `Network` Field Depends on Only One Component of the Composite Primary Key
Roles with `Network` Field
ProgramID (CPK/FK)	ActorID (CPK/FK)	Network	RoleFirstName	<<otherfields>>
14 (Bewitched)	20 (Elizabeth Montgomery)	ABC	Samantha
14 (Bewitched)	22 (Dick York)	ABC	Darrin
18 (MASH)	27 (Alan Alda)	CBS	Hawkeye
18 (MASH)	28 (McLean	CBS	Henry
	Stevenson)
18 (MASH)	30 (Gary	CBS	Radar
	Burghoff)

Recall that the table has a composite primary keythat is, the primary key comprises both the ProgramID and the ActorID. Take a look at the values in the Network column. The values are repeated for the same program. For example, there is one value of ABC for each value of 14 (Bewitched), so the table tells you twice that ABC is the network of Bewitched.

The reason for the redundant data is that the Network field depends on only one part of the primary key of the Roles table: ProgramID. The network has nothing to do with the other component of the primary key: ActorID. Thus, the Network field is in the wrong table; it belongs in the Programs table (see Table 3.39) because only the program identifies the network, not the actor.

Table 3.39. The `Network` Field Belongs in the Programs Table Because Only the Program Describes the Network
ProgramID	Program	Network
14	Bewitched	ABC
15	The Odd Couple	ABC
16	The Twilight Zone	CBS
17	Northern Exposure	CBS
18	MAS*H	CBS

Third Normal Form

The third normal form should be mostly familiar to you. To meet its requirements, you need to remove any field that depends on other, nonkey fields.

You tackled this problem in the Classic TV database when you found a calculated fieldspecifically, Years on Air. The number in that field can be derived by subtracting the start date from the end date. In this case, the value in the calculated field depends on the values in one or more fields in the table. Thus, the field is unnecessary.

In Table 3.40, the Total Discount field is calculated by multiplying the discount by the unit price. This field should be eliminated because it totally depends on the Unit Price and Discount fields for its own value.

Table 3.40. The `Total Discount` Field Is a Calculated Field That Violates the Third Normal Form
Order Details with Calculated Field
OrderID	ProductID	UnitPrice	Discount	Total Discount
11600	Clam Chowder	$10.00	5%	$0.50
11600	Dried Apples	$15.00	10%	$1.50
11601	Tartar Sauce	$12.00	15%	$1.80
11601	Mozzarella	$12.00	5%	$ .60
11601	Ravioli Angelo	$15.00	5%	$ .75

First Normal Form

Table 3.35. The "Repeated Fields" Approach to Storing Values

Table 3.36. The "Multivalue" Approach to Storing Values

Table 3.37. A Resolved Multivalue Field

Second Normal Form

Table 3.38. The Values in the `Network` Field Are Repeated Unnecessarily Because the `Network` Field Depends on Only One Component of the Composite Primary Key

Table 3.39. The `Network` Field Belongs in the Programs Table Because Only the Program Describes the Network

Third Normal Form

Table 3.40. The `Total Discount` Field Is a Calculated Field That Violates the Third Normal Form

What Is Normalization?

First Normal Form

Table 3.35. The "Repeated Fields" Approach to Storing Values

Table 3.36. The "Multivalue" Approach to Storing Values

Table 3.37. A Resolved Multivalue Field

Second Normal Form

Table 3.38. The Values in the Network Field Are Repeated Unnecessarily Because the Network Field Depends on Only One Component of the Composite Primary Key

Table 3.39. The Network Field Belongs in the Programs Table Because Only the Program Describes the Network

Third Normal Form

Table 3.40. The Total Discount Field Is a Calculated Field That Violates the Third Normal Form

Table 3.38. The Values in the `Network` Field Are Repeated Unnecessarily Because the `Network` Field Depends on Only One Component of the Composite Primary Key

Table 3.39. The `Network` Field Belongs in the Programs Table Because Only the Program Describes the Network

Table 3.40. The `Total Discount` Field Is a Calculated Field That Violates the Third Normal Form