Data Normalization | MCAD/MCSE/MCDBA 70-229 Exam Cram 2: Designing & Implementing Databases w/SQL Server 2000 Enterprise Edition

The process of normalization is the division of entities in an attempt to provide the most efficient use of data storage. At times, denormalization is planned redundancy that is subsequently performed to improve response time and better use of resources. The process of designing the logical structure of a database is an attempt to provide a degree of normalization combined with aspects of denormalization to produce optimum storage efficiency while still providing acceptable levels of performance and resource utilization. In most instances, data is fully normalized before any aspects of denormalization are considered. In fact, denormalization usually is not approached until the physical model is in development.

Normalization is usually applied in practice from the outset of data modeling. After you're comfortable with all the modeling concepts, you will find yourself implementing normalized structures as a form of good practice. As with all other good development habits, you must first work through the concepts at a granular level before they begin to become second nature.

Normalization in general refers to how you implement and store data. Normalization is a design process intended to eliminate duplicate data. In a normalized entity, the redundant data is removed and the entity is simplified to its most basic form. This usually leads to a more involved entity structure with more entities. In the same way, the process of database normalization and planned denormalization is the process of simplifying data and data design to achieve maximum performance and simplicity. This denormalization process involves the planned addition of redundant data.

Although both normalization and denormalization are valid, necessary processes, the two achieve opposite goals. They don't by themselves achieve maximum performance and simplicity, though they do strive for a perfect balance between performance (denormalization) and simplicity (normalization). Normalization means no duplicate data.

Data Normalization with Normal Forms

In 1970, Dr. E. F. Codd designed three regulations a relational database adheres to, known as normal forms. Today they are known as the first, second, and third normal forms. (Normal forms do exceed three, but the first three are the only ones widely used.) The goal of the initial database design is to simplify the database into the third normal form.

Using normal forms provides these basic advantages:

No data redundancy contributing to data integrity
Index columns for faster sorting and searching
Smaller entities that reduce table locking and data contention
Query optimization

Although in most cases a data model is taken only to the third normal form, it is worth noting that there are actually five forms. Because development to the fourth and fifth normal forms is not a requirement for the exam, we mentioned those forms here only for completeness.

The rules provided by these normal forms are discussed in the following sections.

First Normal Form

The first normal form, or 1NF, defines the foundation for the relational database system. An attribute that is represented only once, and thus is not repeating, is known as an atomic value. Attributes should be atomic, which means that they cannot (or should not) be further broken down, based on the business needs for the use of the attribute. The first normal form defines that all attributes be atomic, which is to say they cannot be decomposed and must be nonrepeating.

In relational database terms, an attribute of an entity shouldn't have more than one definable piece of data or repeating groups. 1NF states that all attributes must be defined in their most singular form, which means that attributes must be decomposed and not further divisible.

A full name should never be used. For example, a field called customer name could be divided into first name and last name and would therefore break the first normal form rule. The first name is a piece of data that is independent from the last name and therefore it should be a separate attribute.

Second Normal Form

The purpose behind the second normal form is to ensure that each attribute belongs in the entity. Any non-key attribute of an entity must depend on the entire primary key, not just a portion of the key. For example, if the primary key of an orders entity contained two fields, customer id and product id, the attribute field product description wouldn't belong, because it has no connection to the customer id, only the product id. This would break the second normal form rule.

Third Normal Form

The third normal form states that a non-key field must not depend on another non-key field. The most obvious example of this rule is in the case of address information. The zip code is dependent on the address area; the city is dependent on the address itself. A large corporation or government agency may choose to have zip code information stored in a separate table and not within the base data to a perfect normalized form. Although in most table designs this situation is denormalized, in a pure normal form a separate entity would be used to provide additional address information such as city and zip code based on the address.

Two other normal forms do exist but aren't commonly implemented. It's entirely possible that by adhering to a third normal form, you may actually accomplish the fourth and fifth forms.

Fourth and Fifth Normal Forms

The fourth normal form dictates that a third normal form has no multivalued dependencies. In other words, every value of an attribute must appear in at least one row with every other value of the other attribute.

The fifth normal form is intended to eliminate joint dependency constraints. This is a theoretical consideration that is thought to have no practical value. If you disregard these forms, the design of the database might be less than perfect, but it should have no loss of functionality.

Normalizing a database is seemingly good but it can hamper performance. In many cases a designer has to consider denormalizing a database. Planned redundancy or denormalization is often brought into the design to provide for better performance or to clarify data.