Understanding Normalization | Applied ADO.NET: Building Data-Driven Solutions

If you've been working with databases for a while, you're probably familiar with the term normalization. Database designers or developers often ask whether a database is normalized. So, what's normalization? Normalization is a process of eliminating data redundancy (except for the redundancy required by foreign keys) in multiple relational tables, and it ensures that data is organized efficiently. When you normalize a database, you basically have three goals:

Ensuring you've organized the data correctly into groups that minimize the amount of duplicate data stored in a database
Organizing the data such that, when you (or your users) modify data in it (such as a person's address or email), the change only has to be made once
Designing a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage

E. F. Codd first proposed the normalization process in 1972. Initially, he proposed three normal forms, which he called first, second, and third normal forms (1NF, 2NF, and 3NF). Subsequently, he proposed a stronger definition of 3NF, known as Boyce-Codd Normal Form (BCNF). Later, others proposed fourth normal form (4NF) and fifth normal form (5NF). Most people agree that a database that's in 3NF is probably good enough—in other words, normalized enough.

First Normal Form (1NF)

A table is in 1NF if the values of its columns are atomic, containing no repeating values. Applying 1NF on a table eliminates duplicate columns from the same table, which creates separate tables for each group of related data, and adds a primary key to the table to identify table rows with a unique value. For example, say you have a Customers table (see Figure A-1). The Customers table stores data about customer orders. The columns store data about a customer such as name, address, and order description.

Figure A-1: Customers table before normalization

The data of the Customers table looks like Figure A-2.

Figure A-2: Two rows of data from the Customers table

Figure A-1 shows the design of a Customers table. As you can see, it's been designed to store up to three orders for any customer—sort of mimicking an array structure in spreadsheet format. There are obvious problems with this design. As you can see from Figure A-2, there are many columns related to orders (that's a classic giveaway of a poorly designed table—mixing two or more entities, or kinds of data). Whenever a customer posts a new order, a new column will be added to the table. Not only that, if the same customer posts more than one order, the duplicate Address and City column data will be added to the table. This scenario adds duplicate data to the table. This table isn't in 1NF because the 1NF rule says that a table is in 1NF if and only if a table's columns have no repeating values.

You apply 1NF on this table by eliminating the details about the orders, providing just enough relational information to link the Customers table with a new Orders table. The new format of this table after 1NF looks like Figure A-3. You still have some information about orders in this Customers table but only basic link information: OrderNumber. You've fixed the repeating fields problem by jamming all the other order information into a single OrderDescription field.

Figure A-3: Customers table schema after 1NF

The data of the table now looks like Figure A-4.

click to expand
Figure A-4: Data of the Customers table after 1NF

Second Normal Form (2NF)

2NF eliminates redundant data. A table is in 2NF if it's in 1NF and every nonkey column (not a key column—primary or foreign) is fully dependent upon the primary key.

Applying 2NF on a table removes duplicate data on multiple rows, places data in separate rows, places grouped data in new tables, and creates relationships between the original tables and new tables through foreign keys. As you can see from Figure A-4, there are six records for two customers and three columns— CustomerName, Address, and City has the same information three times. 2NF eliminate these cases. Under 2NF, you separate data into two different tables and relate these tables using a foreign key. Records in tables should not depend on anything other than the table's primary key.

In this example, you now create a separate Orders table. As you can probably guess, the Orders table stores information related to customer orders. It relates back to the correct customer via the CustomerId column. Now the two tables, Customers and Orders, look like Figure A-5 and Figure A-6.

click to expand
Figure A-5: Customers table after 2NF

Figure A-6: Orders table after 2NF

As you can see from Figures A-5 and A-6, both tables now have a primary key—denoted by the key icon to the left of the column name. This key will be unique for each record. The CustomerId column of the Customers table is mapped to the CustomerId column of the Orders table. The CustomerId column of the Orders table is a foreign key. The relationship between the Customers and Orders tables looks like Figure A-7.

Figure A-7: Relationship between the Customers and Orders tables

Now the data in these tables look like Figure A-8 and Figure A-9.

click to expand
Figure A-8: Customers table after 2NF

click to expand
Figure A-9: Orders table after 2NF

Third Normal Form (3NF)

A table is in 3NF if all columns of a table depend upon the primary key. 3NF eliminates columns that don't depend on the table's primary key. (Note that primary keys don't have to be single columns as shown in these simple examples.)

For example, as you can see from Figure A-10, CustomerId, UnitPrice, and Quantity depend on OrderId (the primary key). But the Total column doesn't depend on the OrderId column; it depends upon the UnitPrice and the Quantity columns.

click to expand
Figure A-10: Orders table after 3NF

The data of the Orders table look like Figure A-11 after applying the 3NF rule on it, and you can calculate Total as the multiplication of UnitPrice and Quantity in your SQL statement.

click to expand
Figure A-11: Data of the Orders table after 3NF

For example:

 SELECT UnitPrice * Quantity AS Total FROM Orders

Boyce-Codd Normal Form (BCNF)

A determinant column is the column on which other columns are fully dependent. BCNF is an extended version of 3NF. A database is in BCNF if and only if every determinant is a candidate key. A candidate key is a combination of columns that can be uniquely used to identify a row. Each table may have one or more candidate keys.

Fourth Normal Form (4NF)

4NF and 5NF seem to be of interest only in computer science classes and in the academic world in general. But, if you're really interested, a database is in 4NF if and only if it's in BCNF and all multivalued dependencies are also functional dependencies. For example, a customer can have multiple orders and multiple addresses. This data (the customer having multiple addresses and multiple orders) can be stored in a single table, but after applying 4NF, the data will be stored in two tables. The first table stores the CustomerId with addresses, and the second table stores the CustomerId with orders.

Fifth Normal Form (5NF or PJ/NF)

A database is in 5NF—or Projection/Join Normal Form (PJ/NF)—if it can't have a lossless decomposition into any number of smaller tables. In other words, this means a table that has been decomposed into three or more smaller tables must be capable of being joined again on common keys to form the original table.

Note

This ends the quick overview of normal forms. SQL Server Magazine has numerous articles about database design and normal forms that you might find helpful at www.sqlmag.com/Articles/Index.cfm?AuthorID=436.

Note

If you've heard about database normal forms and normalization, chances are you've also heard about denormalization, which basically refers to the process of embracing a less than rigorously perfect relational design in favor of performance. You might, for example, include calculated fields or aggregations in your design. By including redundant information in your design, you denormalize it.