Understanding Database Normalization | MCSE Training Guide (70-229). Designing and Implementing Databases with MicrosoftR SQL ServerT 2000 Enterprise Edition

Define entities. Considering normalization and denormalization.
- Specify degree of normalization.

Normalization in general refers to how you implement and store data. (Normalization is a design process intended to eliminate duplicate data, it is not how you implement.) In a normalized entity, the redundant data is removed and the entity is simplified to its most basic form. This usually leads to a more involved entity structure with more entities. In the same way, database normalization and planned denormalization is the process of simplifying data and data design to achieve maximum performance and simplicity. This denormalization process involves the planned addition of redundant data.

Although both normalization and denormalization are valid, necessary processes, the two achieve opposite goals. They don't by themselves achieve maximum performance and simplicity, though they do strive for a perfect balance between performance (denormalization) and simplicity (normalization). Normalization means no duplicate data.

In 1970, Dr. E. F. Codd designed three regulations a relational database adheres to, known as normal forms, and today known as the first, second, and third normal forms. (Normal forms do exceed three, but the first three are the only ones widely used.) The goal of the initial database design is to simplify the database into the third normal form.

Before starting to normalize a database, look at the following guidelines:

All columns should be decomposed (broken down to their most basic form).
Many-to-many relationships should be converted to pairs of one-to-many relationships. This is not part of normalization, but part of the physical implementation.
Primary and Foreign Keys should be created and identified.

Normalizing a database is seemingly good, but can hamper performance. In many cases a designer has to consider denormalizing a database, which is discussed a little later. The following sections discuss the three normal forms of a database and how to implement them.

First Normal Form

The first normal form, or 1NF, defines the foundation for the relational database system. Attributes should be atomic, which means that they cannot (or should not) be further broken down, based on the business needs for the use of the attribute. The first normal form defines that all attributes be atomic, which is to say they cannot be decomposed and must be non-repeating. In relational database terms, 1NF states that all attributes must be defined in their singular-most form; which means that attributes must be decomposed and not further divisible.

An attribute, which is not repeating, is known as an atomic value , which is a value that is represented only one time. You learned about the decomposition of attributes in the earlier discussion about having Street, City, State, and Zip fields rather than a general Address field. This is precisely what the 1NF does.

Successfully passing the 1NF eventually increases data access and performance. If there is ever a violation of the first normal form, it can easily be resolved. To know whether your database has violated the restrictions of 1NF, look at the attributes in an entity and then see whether the attribute repeats or is akin to any of the other attributes. If so, then you know that you are violating the first normal form.

To resolve this type of situation, find all the attributes that seem to be repeating. For example, take a look at a variation of the Employee entity where a repeated array of values holds information for an employee's multiple machine capabilities, as seen in Figure 2.6.

Figure 2.6. The Employee entity violating a restriction of 1NF.

graphics/02fig06.gif

To resolve this type of situation, find all the attributes that seem to be repeating. Create a new entity with a name similar to the repeating attribute and place all repeatedly occurring attributes into it. For this example, if a machine entity didn't already exist it would be created. After creating a new entity and shifting repeated attributes into it, you have to create a key (a key that uniquely identifies each row in the new entity) and a related key in the Employee entity. In the example it is better treated as a many-to-many relationship. Many employees can run a machine and an employee can run many machines. This relationship would look similar to the illustration in Figure 2.7.

Figure 2.7. Employee/machinery relationship.

graphics/02fig07.gif

After eliminating duplicates and adding additional entities, you can then proceed with the second normal form and ensure that attributes are assigned to the most appropriate entity.

Second Normal Form

The first requirement of the second normal form is that it meets the requirements of the first normal form. The second requirement is that all non-key attributes are dependant on the entity key. To pass 2NF, every attribute in an entity must depend on the whole key, not just a part of it. A violation in 2NF occurs when these requirements are not met.

To know whether an entity is violating the restrictions of 2NF, see whether the attribute is dependant on only part of the Primary Key. 2NF applies in only those entities that have a compound Primary Key and a data element is dependent on only a portion of that key. In the example case there is no violation of 2NF, so Figure 2.8 uses a separate simple example to illustrate this point. In Figure 2.8, the partial dependence is true for the Square Footage attribute in that it is dependant on only the Warehouse ID and not the Region ID.

Figure 2.8. The RegionWarehouse entity violating a restriction of 2NF.

graphics/02fig08.gif

To resolve a 2NF violation, create a new entity for the partially-dependant attributes and then place these attributes inside it. Then create a key to make a relationship to the original entity. The 2NF ensures model flexibility and tries to ensure the prohibition of redundant data.

Meeting the second normal form brings the design closer to an efficient model. A final set of adjustments to consider is the third normal form.

Third Normal Form

Third normal form shares a common objective like the other normal forms, in that its goal is to eliminate redundant data. However, it is more similar to the second normal form because it accomplishes this by cutting down interdependencies on non-key attributes. The third normal form ensures that you get the most consistency with the entity layout after you normalize the database. 3NF states that all non-Primary Key attributes must depend on the key. You cannot have any indirect or transitive dependencies, as described in the following paragraphs.

As stated previously, a database design's initial goal is to achieve the value and correctness of the third normal form to remove all data redundancy. The third normal form ensures elimination of data redundancy by eliminating interdependencies between non-key attributes.

Identifying a 3NF violation involves looking at an attribute and then asking whether this attribute depends on the availability of any other non-key attribute.

For example, examine the entity in Figure 2.9.

Figure 2.9. The Trainee entity violating a restriction of 3NF.

graphics/02fig09.gif

The Trainee entity is violating one of the restrictions of 3NF because it has three interdependent non-key attributes, which are EnrolledCourseID , EnrolledCourse , and DateEnrolled . These are referred to as interdependent attributes because they depend on the attribute EnrolledCourseID . To implement and resolve this 3NF violation, add the enrollment date to the Course table and eliminate the extra unnecessary columns. This eliminates the redundant data in an entity (see Figure 2.10).

Figure 2.10. The Trainee entity is no longer violating 3NF.

graphics/02fig10.gif

Normalizing a design provides for efficient use of storage, but in many situations will affect performance. Planned redundancy or denormalization are often brought into the design to provide for better performance or to clarify data.

When to Denormalize

Denormalization is the planned placement of redundant data to minimize table joins, reduce network traffic, eliminate repeated calculations, and provide for application-specific procedures. A database that is used primarily for decision support (as opposed to update- intensive transaction processing) may not have redundant updates and may be more understandable and efficient for queries if the design is not fully normalized. In data warehousing the results of calculations are often stored with the data, so that type of processing does not have to occur when the data is read. Many reporting systems also denormalize data to produce results specific to the application.

Sometimes you'll encounter situations where a fully normalized data model just won't perform in the situation you place it in. In situations like this, you have to denormalize your database. A normalized database needs more join queries to gather information from multiple entities (because entities are divided into smaller entities when undergoing the process of normalization). Therefore, CPU usage might overwhelmingly increase, and cause an application to slow or freeze. In situations like this, denormalization is appropriate.

EXAM TIP

Normalization on the Exam The exam will require that you know what to denormalize, as stated in the Microsoft exam sub-objective: "Specify degree of normalization." Normalization is not always the best design for a given database. Normalization creates numerous , small, interrelated tables. Processing the data in these tables can incur a great deal of extra work and other overhead to combine the related data. The extra processing reduces the performance of the database. In these situations, denormalizing the database slightly to simplify complex processes can improve performance.

Normalization and denormalization processes begin to put a high-performance database system together in a logical fashion. Any seasoned database person knows , however, that performance isn't the only concern. It is necessary in any database system to minimize the administrative management needed to keep a database functional and accurate. The aim here is integrity.