7.2 Normalization

only for RuBoard - do not distribute or recompile

E. F. Codd, then a researcher for IBM, first presented the concept of database normalization in several important papers written in the 1970s. The aim of normalization remains the same today: to eradicate certain undesirable characteristics from a database design. Specifically, the goal is to remove certain kinds of data redundancy and therefore avoid update anomalies. Update anomalies are difficulties with the insert, update, and delete operations on a database due to the data structure. Normalization additionally aids in the production of a design that is a high-quality representation of the real world; thus, normalization increases the clarity of the data model.

7.2.1 First Normal Form

The general concept of normalization includes several "normal forms." An entity is said to be in the first normal form (1NF) when all attributes are single valued. To apply 1NF to an entity, we have to verify that each attribute in the entity has a single value for each instance of the entity. If any attribute has repeating values, it is not in 1NF.

A quick look back at our database reveals that we have repeating values in the Songs attribute, so the CD is clearly not in 1NF. An entity with repeating values indicates that we have missed at least one other entity. One way to discover missing entities is to look at each attribute and ask the question "What thing does this describe?"

What does Songs describe? It lists the songs on the CD. So a song is another "thing" that we capture data about and is probably an entity. We will add it to our diagram and give it a Song Name attribute. To complete the Song entity, we need to ask if there is more about a Song that we would like to capture. Earlier, we identified song length as something we might want to capture, so let's add it. Figure 7-3 shows the new data model.

Figure 7-3. A data model with CD and Song entities

Now that Song Name and Song Length are attributes in a Song entity, we have a data model with two entities in 1NF. None of their attributes contain multiple values. Unfortunately, we have not shown any way of relating a CD to a Song .

7.2.2 The Unique Identifier

Before discussing relationships, we need to impose one more rule on entities. Each entity must have a unique identifier ”we'll call it the ID. An ID is an attribute of an entity that meets the following rules:

It is unique across all instances of the entity.
It has a non- NULL value for each instance of the entity for the entire lifetime of the instance.
It has a value that never changes for the entire lifetime of the instance.

Identifier selection is critical because the identifier is also used to model relationships. If, after you've selected an ID for an entity, you find that it doesn't meet one of the above rules, this could affect your entire data model.

Novice data modelers often make the mistake of choosing attributes that should not be identifiers and making them identifiers. If, for example, you have a Person entity, it might be tempting to use the Name attribute as the identifier because all people have a name and that name never changes. But names do change. What if the person marries? What if the person decides to legally change his name? What if you misspelled the name when you first entered it? If any of these events causes a name change, the third rule of identifiers is violated. Worse, is a name really ever unique? Unless you can guarantee with 100% certainty that the Name is unique, you will be violating the first rule. Finally, you do know that all Person instances have non- NULL names. But are you certain that you will always know the name of a Person when you first enter information about that person in the database? Depending on your application processes, you may not know the name of a Person when a record is first created. There are many problems with taking a non-identifying attribute and making it an identifier.

The solution to the identifier problem is to invent an identifying attribute that has no other meaning except to serve as an identifying attribute. Because this attribute is invented and completely unrelated to the entity, we have full control over it and can guarantee that it meets the rules of unique identifiers. Figure 7-4 adds invented ID attributes to each of our entities. A unique identifier is diagrammed as an underlined attribute.

Figure 7-4. The CD and Song entities with their unique identifiers

7.2.3 Relationships

The identifiers in our entities enable us to model their relationships. A relationship describes a binary association between two entities. A relationship may also exist between an entity and itself. Such a relationship is called a recursive relationship . Each entity within a relationship describes and is described by the other entity. Each side of the relationship has two components : a name and a degree.

Each side of the relationship has a name that describes the relationship. Take two hypothetical entities, an Employee and a Department . One possible relationship between the two is that an Employee is "assigned to" a Department . That Department is "responsible for" an Employee . The Employee side of the relationship is thus named "assigned to" and the Department side "responsible for."

Degree, also referred to as cardinality, states how many instances of the describing entity must describe one instance of the described entity. Degree is expressed using two different values: "one and only one" (1) and "one or many" (M). An employee is assigned to one department at a time, so Employee has a one-and-only-one relationship with Department . In the other direction, a department is responsible for many employees . We therefore say Department has a "one-or-many" relationship with Employee . As a result, a Department could have exactly one Employee .

It is sometimes helpful to express a relationship verbally. One way of doing this is to plug the various components of the relationship into this formula:

entity1 has [one and only one one or many] entity2

Note that this formula must be applied in both directions to fully describe the relationship between two entities. Using this formula, Employee and Department would be expressed like so:

Each Employee must be assigned to one and only one Department .
Each Department may be responsible for one or many Employees .

We can use this formula to describe the entities in our data model. A CD contains one or many Song s, and a Song is contained on one and only one CD . In reality, a Song can be contained on many CD s, but we ignore this for the purposes of this example. In our data model, this relationship can be shown by drawing a line between the two entities. Degree is expressed with a straight line for "one and only one" relationships or a "crow's foot " for "one or many" relationships. Figure 7-5 illustrates these conventions.

Figure 7-5. Anatomy of a relationship

How does this apply to the relationship between Song and CD ? Figure 7-6 shows the data model with the relationships in place.

Figure 7-6. CD/Song relationship

With these relationships firmly in place, we can go back to the normalization process and improve upon the design. So far, we have normalized repeating song values into a new entity, Song , and modeled the relationship between it and the CD entity.

7.2.4 Second Normal Form

An entity is said to be in the second normal form (2NF) if it is already in 1NF and all non-identifying attributes are dependent on the entity's entire unique identifier. If any attribute is not dependent entirely on the entity's unique identifier, that attribute has been misplaced and must be removed. For example, "Herbie Hancock" is the band name for two different CDs, and therefore Band Name is not entirely dependent on CD ID . To normalize a misplaced attribute, either find the entity where the attribute belongs or create an additional entity for the attribute.

In our example, we have a sign that Band Name should be part of a new entity with some relationship to CD . As before, we resolve this problem by asking the question: "What does a band name describe"? It describes a band, or more generally , an artist. Artist is yet another thing we are capturing data about and is therefore probably an entity. We will add it to our diagram with Band Name as an attribute. Since not all artists are bands, we will rename the attribute Artist Name . Figure 7-7 shows the new state of the model.

Figure 7-7. The data model with the new Artist entity

Of course, the relationships for the new Artist table are missing. We know that each Artist has one or many CD rows. Each CD could have one or many Artist rows. We model this in Figure 7-8.

Figure 7-8. The Artist relationships in the data model

We originally had the Band Name attribute in the CD entity. It thus seemed natural to make Artist directly related to CD . But is this really correct? On closer inspection, it would seem that there should be a direct relationship between an Artist and a Song . Each Artist has one or more Song rows. Each Song is performed by one and only one Artist . The true relationship appears in Figure 7-9.

Figure 7-9. The real relationship between Artist and the rest of our data model

Not only does this make more sense than a relationship between Artist and CD , but it also addresses the issue of compilation CDs.

7.2.5 Kinds of Relationships

When modeling a relationship between entities, it is important to determine both directions of the relationship. After both sides of the relationship have been determined, we end up with three main kinds of relationships. If both sides of the relationship have a degree of one and only one, it is called a "one-to-one" or "1-to-1" relationship. As we will find out later, one-to-one relationships are rare. We do not have one in our data model.

If one side has a degree of "one or many," and the other side has a degree of "one and only one," the relationship is a "one-to-many" or "1-to-M" relationship. All the relationships in our current data model are one-to-many relationships. This is to be expected since one-to-many relationships are the most common.

The final kind of relationship is where both sides are "one or many" relationships. These are called "many-to-many" or "M-to-M" relationships. In an earlier version of our data model, the Artist / CD relationship was a many-to-many relationship.

7.2.6 Refining Relationships

As we noted earlier, one-to-one relationships are quite rare. In fact, if you encounter one during your data modeling, you should take a closer look at your design. A one-to-one relationship may imply that two entities are really the same and should be folded into a single entity.

Many-to-many relationships are more common than one-to-one relationships. In these relationships, there is often some data we want to capture about the relationship. For example, take a look at the earlier version of our data model in Figure 7-8 that had the many-to-many relationship between Artist and CD . What data might we want to capture about that relationship? An Artist has a relationship with a CD because an artist has one or more songs on that CD. The data model in Figure 7-9 is actually another representation of this many-to-many relationship.

All many-to-many relationships should be resolved using the following technique:

Create a new entity (sometimes referred to as a junction entity ). Name it appropriately. If you cannot think of an appropriate name for the junction entity, name it by combining the names of the two related entities (e.g., ArtistCD ). In our data model, Song is a junction entity for the Artist / CD relationship.
Relate the new entity to the two original entities. Each of the original entities should have a one-to-many relationship with the junction entity.
If the new entity does not have an obvious unique identifier, inherit the identifying attributes from the original entities into the junction entity and use them together as the unique identifier for the new entity.

In almost all cases, you will find additional attributes that belong in the new junction entity. In any case, the many-to-many relationship needs to be resolved; otherwise , you will have a problem translating your data model into a physical schema.

7.2.7 More 2NF

Our data model is still not in 2NF. The Record Label attribute has only one value for each CD , but we see the same Record Label in multiple CD rows. As we saw with Band Name , this duplication indicates that Record Label should be part of its own entity. Each Record Label releases one or many CD rows. Each CD is released by only one Record Label . Figure 7-10 shows this relationship with the data model in 2NF.

Figure 7-10. Our data model in second normal form

7.2.8 Third Normal Form

An entity is said to be in the third normal form (3NF) if it is already in 2NF and no non-identifying attributes are dependent on any other non-identifying attributes. A non-identifying attribute is any attribute that is not a part of the identifier for the entity. Attributes that are dependent on other non-identifying attributes are normalized by moving both the dependent attribute and the attribute on which it is dependent into a new entity.

If we wanted to track Record Label address information, we would have a problem putting it in 3NF. The Record Label entity with address data would have State Name and State Abbreviation attributes. Though we really do not need this information to track CD data, we will add it to our data model for the sake of our example. Figure 7-11 shows address data in the Record Label entity.

Figure 7-11. Record Label address information in our CD database

The values of State Name and State Abbreviation would conform to 1NF because they have only one value per record in the Record Label entity. The problem here is that State Name and State Abbreviation are dependent on each other. In other words, if we change the State Abbreviation for a particular Record Label ”from MN to CA ”we also have to change the State Name ”from Minnesota to California . We would normalize this by creating a State entity with State Name and State Abbreviation attributes. Figure 7-12 shows how to relate this new entity to the Record Label entity.

Figure 7-12. Our data model in third normal form

Now our data model is in 3NF, and we can say that it is normalized. There are other normal forms that have some value from a database design standpoint, but these are beyond the scope of this book. For most design purposes, 3NF is sufficient to guarantee a proper design.

only for RuBoard - do not distribute or recompile