Now that we've taken a look at primary keys and foreign keys, and (briefly) at selecting, inserting, updating, and deleting the information in a database, we can have a little look at one final aspect of using and designing relational databases. Normalization is a term that's often bandied about, but tends to be poorly understood. In this section, we'll try to make things a little clearer. For an explanation of exactly what database normalization is, we can return to our friend Dr Codd, who outlined the first three rules of normalization in his 1972 paper, Further Normalization of the Data Base Relational Model. Since his work, other rules have been added to the list, but we won't discuss them here - this is not a book about database theory, and to go further would take us beyond the scope of what it necessary for us to cover. Normalization has several purposes. One of these is to prevent the possibility of redundant data that can become out of sync (such as when a field in one record with a particular value gets updated, but other duplicate instances of the data fail to get updated). Another is to prepare for unforeseen scalability issues. Thinking about the Customers table of the Northwind database, what if a later requirement were to mandate that we should store more than one shipping address per client? That would precipitate a change to the database itself, to all related SQL code, and possibly to the code of applications that use it too. More succinctly stated, the purpose of normalization is to reduce the chances for anomalies to occur in a database during the processes of inserting, deleting, and updating data.
When the first three rules of normalization (the so-called first, second, and third normal forms) are adhered to, these three potential problems are dealt with. Through normalization, we ensure that our database will:
First Normal Form (1NF)
To put the formal definition of the first normal form in more practical terms, 1NF really has two parts. The first is that any value in a field should be atomic (indivisible). An example of a table that is not in 1NF is:
ID Supplier_Products Price ----------- -------------------------------------------------- ------ 1 Lenox Serving Platter 156.95 2 Mikasa Serving Platter 93.95 3 Noritake Serving Platter 206.95 This violates the principle of atomic data because the values in the Supplier_Products field contain two distinct data elements: the supplier, and the product they supply. One step we could take to fix this problem would be:
ID ServingPlatter LenuxPrice MikasaPrice NoritakePrice ----------- --------------- ---------- ----------- ------------- 1 Serving Platter 156.95 93.95 206.95 But this 'solution' violates the second part of 1NF, which demands that an atomic value should not be spread across various fields. In this case, the price is an atomic value, but we've spread it across LenuxPrice, MikasaPrice, and NoritakePrice. Let's make one last attempt to reach 1NF in this example:
Products ID Name ----------- -------------- 1 Serving Platter Suppliers ID ProductID Name Price ---------- ----------- --------------- ----- 1 1 Lenux 156.95 2 1 Mikasa 93.95 3 1 Noritake 206.95 By creating separate Products and Suppliers tables, we now have a Price field with atomic data: the price.
Second Normal Form (2NF)
The first part of this rule says that for a database to be in 2NF, it must first fulfill all the criteria of a 1NF database. In fact, all of the rules are cumulative - each rule depends on the criteria of all previous rules being met. If we examine our Products table above, we see that it meets 2NF. The key is ID, so the only non-key attribute is Name. Name is fully dependent on the ID field, which identifies a unique product. If we look at the Suppliers table, however, we find that it does not meet 2NF. Again, Name is fully dependent on the ID, since the latter is the key field that identifies a unique supplier. However, the non-key ProductID and Price fields don't completely depend on ID - not all of Lenux's products cost $156.9.5, for example. We need to add a third table to reach 2NF and retain all of the data and the relationships between them:
Products ID Name ----------- -------------- 1 Serving Platter Suppliers ID Name ---------- -------------- 1 Lenux 2 Mikasa 3 Noritake Products_Suppliers ProductID SupplierID Price ----------- ----------- ------ 1 1 156.95 1 2 93.95 1 3 206.95 Here, the Products table remains the same - Products.Name still depends entirely on Products.ID, so it's still in 2NF. Suppliers is also now in 2NF, since it has a key field (ID) and a non-key field (Name) that's completely dependent on it. Products_Suppliers has a compound key made up of ProductID and SupplierID. The only non-key field is Price, and Price is completely dependent on the key.
Third Normal Form (3NF)
Let's now add the supplier's phone number to our example, as follows:
Products_Suppliers ProductID SupplierID Price SupplierPhone ----------- ----------- ------- ------------ 1 1 156.95 301-3344 1 2 93.95 302-4455 1 3 206.95 303-5566 Does this table still meet 1NF? Certainly - all fields are atomic. Does it meet 2NF? Yes - Price and SupplierPhone depend on the key. Does it meet 3NF? As you may have already guessed, it does not. SupplierPhone depends on the key, but only partially so. That is, it depends on the SupplierID part of the key, but not the ProductID part. This type of dependency is said to be transitive. The solution should be fairly obvious: we need to move the SupplierPhone from the Products_Suppliers table to the Suppliers table. If this sounds too easy, it really is. Once you understand these three normal forms, it becomes quite easy to spot non-normalized tables, and then to do something about them.
|