Normalization | Beginning ASP.NET 2.0 and Databases (Wrox Beginning Guides)

Chapter 2 - Relational Databases

byJohn Kauffman, Fabio Claudio Ferracchiatiet al.?
Wrox Press ?2002

Now that we've taken a look at primary keys and foreign keys, and (briefly) at selecting, inserting, updating, and deleting the information in a database, we can have a little look at one final aspect of using and designing relational databases. Normalization is a term that's often bandied about, but tends to be poorly understood. In this section, we'll try to make things a little clearer.

For an explanation of exactly what database normalization is, we can return to our friend Dr Codd, who outlined the first three rules of normalization in his 1972 paper, Further Normalization of the Data Base Relational Model. Since his work, other rules have been added to the list, but we won't discuss them here - this is not a book about database theory, and to go further would take us beyond the scope of what it necessary for us to cover.

Normalization has several purposes. One of these is to prevent the possibility of redundant data that can become out of sync (such as when a field in one record with a particular value gets updated, but other duplicate instances of the data fail to get updated). Another is to prepare for unforeseen scalability issues. Thinking about the Customers table of the Northwind database, what if a later requirement were to mandate that we should store more than one shipping address per client? That would precipitate a change to the database itself, to all related SQL code, and possibly to the code of applications that use it too.

More succinctly stated, the purpose of normalization is to reduce the chances for anomalies to occur in a database during the processes of inserting, deleting, and updating data.

An insertion anomaly is a failure to place a value into all the places in the database where it should be stored. In a properly normalized database, a single value needs to be inserted into only one place in the database; in an inadequately normalized database, data may need to be inserted into more than one place. There is a risk that one or more insertions will get overlooked.
A deletion anomaly is a failure to remove data. In a properly normalized database, obsolete data needs to be deleted from only one place in the database; in an inadequately normalized database, obsolete data may need to be deleted from several places.
An update anomaly is very similar to an insertion anomaly. It occurs when a value is stored in multiple places in the database, and our UPDATE fails to update one or more of those values.

When the first three rules of normalization (the so-called first, second, and third normal forms) are adhered to, these three potential problems are dealt with. Through normalization, we ensure that our database will:

Serve current as well as unforeseeable future needs
Have as little redundancy as possible
Accommodate multiple values for types of data that require them
Permit efficient updates of the data in the database
Avoid the danger of losing data unknowingly

First Normal Form (1NF)

Note

Formal definition: A relation is in first normal form (1NF) if and only if all underlying simple domains contain only atomic values.

To put the formal definition of the first normal form in more practical terms, 1NF really has two parts. The first is that any value in a field should be atomic (indivisible). An example of a table that is not in 1NF is:

    ID       Supplier_Products                                  Price ----------- -------------------------------------------------- ------    1        Lenox Serving Platter                              156.95    2        Mikasa Serving Platter                             93.95    3        Noritake Serving Platter                           206.95

This violates the principle of atomic data because the values in the Supplier_Products field contain two distinct data elements: the supplier, and the product they supply. One step we could take to fix this problem would be:

    ID          ServingPlatter  LenuxPrice MikasaPrice NoritakePrice    ----------- --------------- ---------- ----------- -------------    1           Serving Platter 156.95     93.95       206.95

But this 'solution' violates the second part of 1NF, which demands that an atomic value should not be spread across various fields. In this case, the price is an atomic value, but we've spread it across LenuxPrice, MikasaPrice, and NoritakePrice. Let's make one last attempt to reach 1NF in this example:

    Products    ID          Name    ----------- --------------    1           Serving Platter    Suppliers    ID      ProductID   Name            Price ---------- ----------- --------------- -----    1       1           Lenux           156.95    2       1           Mikasa          93.95    3       1           Noritake        206.95

By creating separate Products and Suppliers tables, we now have a Price field with atomic data: the price.

Second Normal Form (2NF)

Note

Formal definition: A relation is in second normal form (2NF) if and only if it is in 1NF, and every non-key attribute is fully dependent on the primary key.

The first part of this rule says that for a database to be in 2NF, it must first fulfill all the criteria of a 1NF database. In fact, all of the rules are cumulative - each rule depends on the criteria of all previous rules being met.

If we examine our Products table above, we see that it meets 2NF. The key is ID, so the only non-key attribute is Name. Name is fully dependent on the ID field, which identifies a unique product.

If we look at the Suppliers table, however, we find that it does not meet 2NF. Again, Name is fully dependent on the ID, since the latter is the key field that identifies a unique supplier. However, the non-key ProductID and Price fields don't completely depend on ID - not all of Lenux's products cost $156.9.5, for example. We need to add a third table to reach 2NF and retain all of the data and the relationships between them:

    Products    ID       Name ----------- --------------    1        Serving Platter    Suppliers    ID      Name ---------- --------------    1       Lenux    2       Mikasa    3       Noritake    Products_Suppliers    ProductID   SupplierID  Price    ----------- ----------- ------    1           1           156.95    1           2           93.95    1           3           206.95

Here, the Products table remains the same - Products.Name still depends entirely on Products.ID, so it's still in 2NF. Suppliers is also now in 2NF, since it has a key field (ID) and a non-key field (Name) that's completely dependent on it. Products_Suppliers has a compound key made up of ProductID and SupplierID. The only non-key field is Price, and Price is completely dependent on the key.

Third Normal Form (3NF)

Note

Formal definition: A relation is in third normal form (3NF) if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key.

Let's now add the supplier's phone number to our example, as follows:

    Products_Suppliers    ProductID   SupplierID  Price   SupplierPhone    ----------- ----------- ------- ------------    1           1           156.95  301-3344    1           2           93.95   302-4455    1           3           206.95  303-5566

Does this table still meet 1NF? Certainly - all fields are atomic. Does it meet 2NF? Yes - Price and SupplierPhone depend on the key. Does it meet 3NF? As you may have already guessed, it does not. SupplierPhone depends on the key, but only partially so. That is, it depends on the SupplierID part of the key, but not the ProductID part. This type of dependency is said to be transitive. The solution should be fairly obvious: we need to move the SupplierPhone from the Products_Suppliers table to the Suppliers table. If this sounds too easy, it really is. Once you understand these three normal forms, it becomes quite easy to spot non-normalized tables, and then to do something about them.

Note

If a table is in 2NF and its key is simple (rather than compound), it is automatically in 3NF as well. You only have to worry about 3NF for tables with compound keys.