Applying Normalization Rules | Beginning Transact-SQL With SQL Server 2000 and 2005

Using the previous Employees table, look for violations of the first rule of normal form. Is there more than one column containing information about the same type of attributes? Beginning with the numbered Address and CityLine fields, each "location" consists of a column for the address and another column for the city, state, and zip code. Because there are two pairs of these columns, this may be a problem. Each phone number is a single column, designated as either the home or work phone. How would I make a single list of all phone numbers? What happens if I need to record a mobile phone for an employee? I could add a third column to the table. How about a fourth? How about the Title column? The Supervisor column may be viewed as a special case but the fact is that the EmployeeName and Supervisor columns store the same type of values. They both represent employees.

I can move all of these columns into separate tables but how do I keep them associated with the employee? This is accomplished through the use of keys. A key is just a simple value used to associate a record in one table to a record in another table (among other things). To satisfy the first rule of normal form, I'll move these columns to different tables and create key values to wire up the associations. In the following example, I have removed the address and city information and have placed it into a separate table.

I have devised a method to identify each employee with a six-character character key, using part of their last and first names. I chose this method because this was once a very popular method for assigning key values. This allows me to maintain the associations between employees and their addresses. In this first iteration (see Figure 2-2), I use this method to make a point. This is a relatively small database for a small company and I don't have any employees with similar first and last names, so this method ought to work just fine, right? Hold that thought for now.

image from book
Figure 2-2:

I do the same thing with the new Addresses table (see Figure 2-3). Each address record is assigned an EmployeeKey value to link it back to the Employees table.

image from book
Figure 2-3:

I have lost a significant piece of information in doing this. I've flattened the address information so I no longer have one address designated as either primary or secondary for an employee. I'll get to this later. For now, I'm only concerned with adhering to the first rule of normal form. Besides, does the information in the old Address1 and CityLine1 columns imply that this is the employee's primary residence? Did I have a complete understanding of the business rules when I began working with this data? Unfortunately, in most ad-hoc projects, it is more often a case of making things up as we go along.

For the phone numbers I'll do the same thing as before, move the phone number values into their own table and then add the corresponding key value to associate them with the employee record. I'm also going to add a column to designate the type of phone number this represents (see Figure 2-4). I could use this as an argument to do the same thing with the addresses, but I'll hold off for now.

image from book
Figure 2-4:

Now that I have three tables with common column values, do I have a relational database? Although it may be true that this is related data, it's not a fully relational database. The key values only give me ability to locate the related records in other tables, but this does nothing to ensure that my data stays intact. Take a look at what I have done so far (see Figure 2-5). The presence of the same key value in all three of these tables is an implied relationship. There is currently no mechanism in place for the database to prevent users from making silly mistakes (such as deleting an employee record without also removing the corresponding address and phone information, for example). This would create a condition, common in early database systems, called orphaned records.

image from book
Figure 2-5:

Before continuing, I must correct a horrible indiscretion. I told you that this business of using parts of different field values (such as the first and last name) to form a meaningful unique key was once a common practice. This is because database system designers in the past often had to create a system where users had to provide a special number to look up a record. To make this easier, they would come up with some kind of intelligent, unique value. It might include characters from a customer or patient's name, or perhaps a series of numbers with digits in specific positions representing an account type or region. For example, when was the last time you called the bank or the telephone company and was asked for your account number? This happens to me all the time. It amazes me that the companies in possession of the most sophisticated, state-of-the-art technology on the planet require me to memorize my account number. How about looking up my account using my name, address, phone number, mother's maiden name, or any of the other information they required when I set up my account?

Using this simple name-based key may have seemed like the right thing to do at the time but the fact is that it will likely get me into a whole lot of trouble down the road. I worked for a company that used this approach in a small, commercial application. The program even appended numbers to the end of the keys so there could be nearly a hundred unique key values for a given last name/first name combination. What they didn't anticipate was that their product would eventually become the most popular medical billing software in the country and would be used in business environments they couldn't possibly have imagined. Eventually this got them into trouble and they had to completely re-architect the application to get around this limitation. One customer, a medical office in the Chicago area, had so many patients with the same or similar names, that they actual ran out of key values.

Thinking Ahead

I'll resolve the EmployeeKey issue by changing it to an auto-sequencing integer called an identity (see Figure 2-6). This is known as a surrogate key, which simply means that key values are absolutely meaningless as far as the user is concerned. The database assigns numbers that will always be unique within this column. The purpose of the key is to uniquely identify each row, not to give employees or users something to memorize.

image from book
Figure 2-6:

The next step is to designate the EmployeeKey in the Employees table as a primary key and the related keys as foreign keys. The foreign key constraints cause the database engine to validate any action that could cause these relationships to be violated. For example, the database would not allow an employee record to be deleted if there were existing, related address or phone records. Related tables are often documented using an entity-relation diagram (ERD). The diagram in Figure 2-7 shows the columns and relationships between these tables.

image from book
Figure 2-7:

There is still work to do. The SupervisorName is also a violation of first normal form because it duplicates some employee names. This is a special case, however, because these names already exist in the Employees table. This can be resolved using a self-join, or relationship on the same table (see Figure 2-8).

image from book
Figure 2-8:

The supervisor designation within the Employees table is now just an integer value referring to another employee record.

The Title column is also in violation of first normal form and could be moved into its own table, as well. A title isn't uniquely owned by an employee, but each employee only has one title. To discern this relationship, you must look at it from both directions:

One employee has one title
One title can have multiple employees

This is a one-to-many relationship from the title to the employee. Resolving this is a simple matter of placing one instance of each title value in a separate table, identified by a unique primary key. A similar column is added to the Employees table as a non-unique foreign key (see Figure 2-9).

image from book
Figure 2-9:

You should see a pattern developing. This is an iterative process that will typically send you in one of two directions in each cycle. You will either continue to move these values into related tables with related keys or you will find discrepancies between your business rules and the data, and then head back to the drawing board to correct the data and table structure.

Multiple Associations

I know that a title can be associated with more than one employee, but what happens if an address is shared by more than one individual? This is a problem in the current database model. I can't use one primary key value and have multiple associations going in both directions. The only way I can do this is to create a primary key that includes two separate values: one for the employee key and one for the address key. However, I can't do this using either of these two tables. If I add the EmployeeKey to the Addresses table, I'm back to the original problem, where I would have duplicate address rows. Because a record in the Addresses table will no longer be directly tied to a record in the Employees table, I must remove the EmployeeKey and create a new primary key for this table and remove the duplicate values. Now the Addresses table conforms to first normal form and third normal form.

Many-to-many relationships are solved using a separate table, often called a join or bridge table. Often, this table contains no user-readable values, only keys to bridge one table to another. However, you may recall that we have a missing bit of information. Remember when I moved the address information from the Address1/CityLine1 columns and Address2/CityLine2 columns into the Address table? I said that we had no way to trace these back to their roots and recall which location was the employee's primary residence? I can now resolve this within the bridge table by adding an additional column (see Figure 2-10).

image from book
Figure 2-10:

The new AddressType column is used to indicate the type of residence. This allows employees to share addresses while eliminating redundant address records. Does the AddressType column violate first normal form? Technically, yes. This could be an opportunity to optimize the database even more by creating yet another table for these values. It looks like there would only be three address type records related to the nine employees (see Figure 2-11).

image from book Figure 2-11:

A simple query is used to obtain detail information about employees at a common address:

 SELECT   EmployeeName, AddressLine, CityLine, AddressType FROM     Employees  INNER JOIN EmployeeAddresses ON Employees.EmployeeKey = EmployeeAddresses.EmployeeKey INNER JOIN Addresses ON EmployeeAddresses.AddressKey = Addresses.AddressKey WHERE    Addresses.AddressKey = 10

It looks like the Vice President of Sales and the Inside Sales Coordinator share a residence only on weekends (see Figure 2-12).

Figure 2-12:

Multi-valued Columns

The last issue with which I must contend is that of having multiple values stored in a single column. There are quite a few examples in these tables. For example, the EmployeeName column in the Employees table contains both the first and last name, the AddressLine column in the Addresses table includes all parts of a street address, and the CityLine contains the city name, U.S. state, and zip code/postal code. Before I just willy-nilly start parsing all the values into separate columns, it's important for me to consider how this data will be used and the advantages and disadvantages of breaking it into pieces. Here are some sample questions that could help to define these business requirements:

Will the employee first name and last name ever be used separately?
Will I ever need to sort on one single value (such as last name)?
Does every employee have a first name and last name? Do they only have a first name and last name (middle names/initials, hyphenated names, and so on)?
Is there any value or need in separating parts of the address line (will I need a list of streets, and so on)?
If I separate parts of the AddressLine or CityLine into separate columns, do I need to accommodate international addresses?

Apparently I do need to consider addresses in at least two locales because I have locations in the UK and the U.S., so I will need to think beyond only one style of address. So, suppose that I have consulted my sponsoring customer and have learned that it would be useful to store separate first names and last names and we don't care about middle names or initials. We also don't plan to accommodate anyone without a first and last name. We have no need to break up the address line. This practice is highly uncommon outside of specialized systems and would be very cumbersome to maintain. We would benefit from storing the city, postal code or zip code, and state or province. It would also be useful to store the country, which is currently not included. Storing geographic information can be tricky due to the lack of consistency across international regions. This may require that you devise your own synonyms for different regional divisions (such as city, township, municipality, county, state, province, and country). In distributing these values into separate columns, you may find even more redundancies. Should these be further normalized and placed into separate tables? Does this ever end? I'll site one example where the city, state, and zip code is normalized. I maintain a system that stores U.S. addresses and stores only the zip code on the individual's record. A separate table contains related city and state information obtained from the U.S. Postal Service.

I won't bore you will the mechanics of separating all of these fields. The process is quite straightforward and very similar to what's already been done. Figure 2-13 shows the completed data model, based on the original flat table.

image from book
Figure 2-13:

To Normalize or to De-normalize?

Depending on how a database is to be used (generally, it will be used for data input or for reporting), it may or may not be appropriate to apply all of the rules just presented. The fact of the matter is that fully normalized databases require some gnarly, complex queries to support reporting and business analysis requirements. To fully comply with all of the rules of normal form often adds more overhead to the application. Without going into detail, here's something to think about: If you are designing a new database system to support a typical business process, you will usually want to follow these rules completely and normalize all of your data structures. After a time, when you have a large volume of data to extract and analyze through reports and business intelligence tools, you may find it appropriate to create a second database system for this purpose. In this database, you will strategically break the rules of normal form, creating redundant values in fewer, larger tables. Here's the catch: Only after you fully understand the rules of normal form will you likely know when and where you should break them.

Question Authority

You should ask yourself an important question as you encounter each opportunity to normalize: "Why?" Know why you should apply the rules and what the benefits and cost are. One of the challenges of applying normalization rules is to know just how far to go and to what degree it makes sense to apply them. At times it just makes sense to break some of the rules. There are good arguments to support both sides of this issue and without a complete understanding of business requirements I would be hard pressed to make a general statement about how data elements (such as phone numbers, titles, or addresses) should always be managed. In short, you need to understand the business requirements for your application and then apply the appropriate level of database normalization to reach that goal. If ever in doubt, it's usually best to err on the side of keeping the rules.