Normalization Rules | Beginning Transact-SQL With SQL Server 2000 and 2005

Because this is not a book about database design, I will not engage in a lengthy discussion on the background behind these rules. Volumes have been written on these subjects. On the surface, a short discussion on database design is an important prerequisite to using the Transact-SQL language. The problem with this is that it's nearly impossible to engage in a short discussion on a topic that is so conceptual and subject to individual style and technique. Like so many "simple" concepts in this industry, this one can be debated almost endlessly. Having written and rewritten this section a few times now, I have decided not to walk through an example and align this with the true rules of normal form, as so many books on this subject do. Rather, I'll briefly present the definitions of each rule and then walk you through an example of distilling an unnormalized database into a practical, normalized form without the weighty discussion of the rules.

Unless you have a taste for mathematical theory, you may not even be interested in the gory details of normalized database design. Throughout this book, I discuss query techniques for normalized and de-normalized data. It would be convenient to say that when a person designs any database, he should do so according to certain rules and patterns. In fact, a number of people do prescribe one single approach regardless of the system they intend to design. Everyone wants to be normal, right? Well, maybe not. Perhaps it will suffice to say that most folks want their data to be normal. But, what does this mean in terms of database design? Are different values stored in one table or should they be stored in multiple tables with some kind of association between them? If the latter approach is taken, how are relationships between these tables devised? This is the subject of a number of books on relational database design. If you are new to this subject and find yourself in the position of a database designer, I would recommend that you pick up a book or research this topic to meet your needs. This subject is discussed in greater detail in Rob Viera's books on SQL Server programming, mentioned at the beginning of the previous chapter. I'll discuss some of the fundamentals here but this is a complex topic that goes beyond the scope of the SQL language.

In the early 1970s, a small group of mathematicians at IBM proposed a set of standards for designing relational data systems. In 1970, Dr. Edger (E. F.) Codd wrote a paper entitled "A Relational Model of Data for Large Shared Data Banks" for the Association of Computing Machinery. He later published 12 principles for relational database design in 1974. These principles described the low-level mechanics of relational database systems as well as the higher-level rules of entity-relation design for a specific database. Dr. Codd teamed with others who also wrote papers on these subjects including Chris (CJ) Date and Raymond F. Boyce. Boyce and Codd are now credited as the authors of relational database design. Codd's original 12 principles of design involved using set calculus and algebraic expressions to access and describe data. One of the goals of this effort was to reduce data redundancy and minimize storage space requirements. Something to consider is that, at the time, data was stored on magnetic tape, paper punch cards, and, eventually, disks ranging from 5 to 20 megabytes in capacity. As the low-level requirements were satisfied by file system and database products, these 12 rules were distilled into the five rules of normal form taught in college classes today.

In short, the rules of normal form, or principles of relational database design, are aimed at the following objectives:

Present data to the relational engine that is set accessible
Label and identify unique records and columns within a table
Promote the smallest necessary result set for data retrieval
Minimize storage space requirements by reducing redundant values in the same table and in multiple tables
Describe standards for relating records in one table to those in another table
Create stability and efficiency in the use of the data system while creating flexibility in its structure

To apply these principles, tables are created with the fewest number of columns (or fields) to define a single entity. For example, if your objective is to keep track of customers who have ordered products, you will store only the customer information in a single table. The order and product information would be stored their own respective tables.

The idea behind even this lowest form of normalizations is to allow straightforward management of the business rules and the queries that implement these rules against data structures that are flexible to accommodate these changes.

The real purpose of first normal form is to standardize the shape of the entity (relation) — to form a two-dimensional grid that is easily accessed and managed using set-based functions in the data engine.

It's really quite difficult to take a table and apply just one rule. One of the tenets of all the rules of normal form is that each rule in succession must conform to its predecessor. In other words, a design that conforms to second normal form must also conform to first normal form. Also, to effectively apply one, you may also be applying a subsequent rule. Although each of these rules describes a distinct principle, they are interrelated. This means that generally speaking, normalization, up to a certain level, is kind of a package deal.

First Normal Form — 1NF

The first rule of normal form states that an entity shouldn't contain duplicate types of attributes. This means that a table shouldn't contain more than one column that represents the same type of non-distinct value.

To convert flat data to First Normal Form, additional tables are created. Duplicate columns are eliminated and the corresponding values are placed into unique rows of a second table. This rule is applied to reduce redundancy along the horizontal axis (columns).

Second Normal Form — 2NF

This rule states that non-key fields may not depend on a portion of the primary key. These fields are placed into a separate table from those that depend on the key value.

To meet Second Normal Form, you must satisfy First Normal Form and decompose attributes that have partial dependencies to the key attribute.

Without a composite key or by correcting a partial dependency by constructing a new entity with its Reference Key, you arrive at Second Normal Form. Then move to the transitive dependencies of Third Normal Form.

Third Normal Form — 3NF

The first rule states that rows are assigned a key value for identification. This rule takes this principle one step further by stating that the uniqueness of any rule depends entirely upon the primary key. My friend Rick, who teaches and writes books on this topic, uses a phrase to help remember this rule: "The uniqueness of a row depends on the key, the whole key, and nothing but the key; so help me Dr. Codd."

In some cases it makes sense for the primary key to be a combination of columns. Redundant values along multiple rows should be eliminated by placing these values into a separate table as well. Compared with First Normal Form, this rule attempts to reduce duplication along the vertical axis (rows).

Fourth and Fifth Normal Form

Boyce and Codd built their standards — Boyce-Codd Normal Form (BCNF) — on earlier ideals that recognized only those discussed thus far. You must satisfy First and Second and Third Normal Form before moving on to satisfy subsequent forms. In fact, it is the process of the First, Second, and Third Normal Forms that drives the need for BCNF. Through the decomposition of attribute functional dependencies, many-to-many relationships develop between some entities. This is sometimes inaccurately left in a state where each entity involved has duplicate candidate keys in one or more of the entities.

Attributes upon which non-key attributes depend are candidate keys. BCNF deals with the dependencies within candidate keys. The short version of what could be a lengthy and complex discussion of mathematical theory is that fourth and fifth normal forms are used to resolve many-to-many relationships. On the surface this seems to be a simple matter — and for our purposes, we'll keep it that way. Customers can buy many different products and products can be purchased by multiple customers. Concerning ourselves with only customers and products, these two entities have a many-to-many relationship. The fact is that you cannot perform many-to-many joins with just two tables. This requires another table, sometimes called a bridge or intermediary table, to make the association. The bridge table typically doesn't need its own specific key value because the combination of primary key values from the two outer tables will always be unique (keep in mind that this is not a requirement of this type of association but is typically the case). Therefore, the bridge table conforms to third normal form by defining its primary key as the composite of the two foreign keys, each corresponding to the primary keys of the two outer, related tables. Fifth normal form is a unique variation of this rule, which factors in additional business logic, disallowing certain key combinations. For our purposes, this should suffice.

Other Normal Forms

A number of disciplines and conceptual approaches to data modeling and database design exist. Among others, these include Unified Modeling Language (UML) and Object Role Modeling (ORM). These include additional forms that help to manage special anomalies that might arise to describe constraints within and between groups or populations of information. The forms that qualify these descriptions usually move into user-defined procedures added to the database and not the declarative structures that have been addressed so far.

Transforming Information into Data

In the real world, the concepts and information you deal with exist in relationships and hierarchies. Just look around you and observe the way things are grouped and associated. As I write this, I'm sitting on a ferry boat. The ferry contains several cars, and cars have passengers. If I needed to store this information in a relational database, I would likely define separate tables to represent each of the entities I just mentioned. These are simple concepts but when applied at all possible levels, some of the associations may take a little more thought and cautious analysis. At times the business rules of data are not quite so straightforward. Often, the best way to discover these rules (and the limits of these rules) is to ask a series of "what if" questions. Given the ferry/car/passenger scenario, what if a passenger came onto the ferry in one car and left in another? What if she walked on and then drove off? Is this important? Do we care? These questions are not arbitrarily answered by a database designer but through the consensus of designers and system stakeholders.

At some point you will need to decide upon the boundaries of your business rules. This is where you decide that a particular exception or condition is beyond the scope of your database system. Don't treat this matter lightly. It is imperative to define specific criteria while also moving quickly past trivial decision points so that you can move forward and stay on schedule. This is the great balancing act of project management.

When you attempt to take this information and store it in a flat, two-dimensional table as rows and columns, you can't help but create redundant or repeating values. Take a look at a simple example using data from the Northwind sample database. The table in Figure 2-1 shows employee records. Each employee has a name and may have two addresses and two phone numbers. Most employees also have a supervisor. This is the way this data might appear in a simple spreadsheet.

image from book
Figure 2-1:

The <NULL> text is SQL Server's way of telling you that there is nothing in that field. Each employee has a name, title, one or two residence locations, a home and work phone number, and a supervisor. This data is easy to read in this form but it may be difficult to use in a proper database system.