Understanding Normalization

In Part I, "Using MySQL," you created some tables in MySQL. Perhaps you've been using MySQL for a while with small projects where the databases contain one or two tables. But as you become more experienced and begin to tackle bigger projects, you may find that the queries you need become more complex and unwieldy, you begin to experience performance problems, or data anomalies start to creep in. Without some knowledge of database design and normalization, these problems may become overwhelming, and you will be unable to take the next step in your mastery of MySQL. Database normalization is a technique that can help you avoid data anomalies and other problems with managing your data. It consists of transforming a table through various stages: 1st normal form, 2nd normal form, 3rd normal form, and beyond. It aims to:

  • Eliminate data redundancies (and therefore use less space)

  • Make it easier to make changes to data, and avoid anomalies when doing so

  • Make referential integrity constraints easier to enforce

  • Produce an easily comprehensible structure that closely resembles the situation the data represents, and allows for growth

Let's begin by creating a sample set of data. You'll walk through the process of normalization first, without worrying about the theory, to get an understanding of the reasons you'd want to normalize. Once you've done that, I'll introduce the theory and steps of the various stages of normalization, which will make the whole process you're about to carefully go through now much simpler the next time you do it.

Imagine you are working on a system that records plants placed in certain locations and the soil descriptions associated with them.

The location:
Location code: 11
Location name: Kirstenbosch Gardens

contains the following three plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone

Plant code: 446
Plant name: Protea
Soil category: B
Soil description: Sandstone/Limestone

Plant code: 482
Plant name: Erica
Soil category: C
Soil description: Limestone

The location:
Location code: 12
Location name: Karbonkelberg Mountains

contains the following two plants:

Plant code: 431
Plant name: Leucadendron
Soil category: A
Soil description: Sandstone

Plant code: 449
Plant name: Restio
Soil category: B
Soil description: Sandstone/Limestone

There is a problem with the previous data. Tables in relational databases are in a grid, or table, format (MySQL, like most modern databases, is a relational database), with each row being one unique record. Let's try and rearrange this data is the form of a tabular report (as shown in Table 8.1).

Table 8.1: Plant Data Displayed As a Tabular Report

Location Code

Location Name

Plant Code

Plant Name

Soil Category

Soil Description

11

Kirstenbosch Gardens

431

Leucadendron

A

Sandstone

  

446

Protea

B

Sandstone/ Limestone

  

482

Erica

C

Limestone

12

Karbonkelberg Mountains

431

Leucadendron

A

Sandstone

  

449

Restio

B

Sandstone/ Limestone

How are you to enter this data into a table in the database? You could try to copy the layout you see in the report above, resulting in a table something like Table 8.2. The null fields reflect the fields where no data was entered.

Table 8.2: Trying to Create a Table with the Plant Data

Location Code

Location Name

Plant Code

Plant Name

Soil Category

Soil Description

11

Kirstenbosch Gardens

431

Leucadendron

A

Sandstone

NULL

NULL

446

Protea

B

Sandstone/ Limestone

NULL

NULL

482

Erica

C

Limestone

12

Karbonkelberg Mountains

431

Leucadendron

A

Sandstone

NULL

NULL

449

Restio

B

Sandstone/ Limestone

This table is not much use, though. The first three rows are actually a group, all belonging to the same location. If you take the third row by itself, the data is incomplete, as you cannot tell the location the Erica is to be found. Also, with the table as it stands, you cannot use the location code, or any other field, as a primary key (remember, a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if you can't uniquely identify each record in it.

So, the solution is to make sure that each table row can stand alone, and is not part of a group, or set. To achieve this, remove the groups, or sets of data, and make each row a complete record in its own right, which results in Table 8.3.

Table 8.3: Each Record Stands Alone

Location Code

Location Name

Plant Code

Plant Name

Soil Category

Soil Description

11

Kirstenbosch Gardens

431

Leucadendron

A

Sandstone

11

Kirstenbosch Gardens

446

Protea

B

Sandstone/ Limestone

11

Kirstenbosch Gardens

482

Erica

C

Limestone

12

Karbonkelberg Mountains

431

Leucadendron

A

Sandstone

12

Karbonkelberg Mountains

449

Restio

B

Sandstone/ Limestone

Note 

The primary keys are shown in italics in Table 8.3 and the following tables.

Notice that the location code cannot be a primary key on its own. It does not uniquely identify a row of data. So, the primary key must be a combination of location code and plant code. Together these two fields uniquely identify one row of data. Think about it: You would never add the same plant type more than once to a particular location. Once you have the fact that it occurs in that location, that's enough. If you want to record quantities of plants at a location—for this example you're just interested in the spread of plants—you don't need to add an entire new record for each plant; rather, just add a quantity field. If for some reason you would be adding more than one instance of a plant/location combination, you'd need to add something else to the key to make it unique.

So, now the data can go in table format, but there are still some problems with it. The table stores the information that code 11 refers to the Kirstenbosch Gardens three times! Besides the waste of space, there is another serious problem. Look carefully at the data in Table 8.4.

Table 8.4: A Data Anomaly

Location Code

Location Name

Plant Code

Plant Name

Soil Category

Soil Description

11

Kirstenbosch Gardens

431

Leucadendron

A

Sandstone

11

Kirstenbosh Gardens

446

Protea

B

Sandstone/ Limestone

11

Kirstenbosch Gardens

482

Erica

C

Limestone

12

Karbonkelberg Mountains

431

Leucadendron

A

Sandstone

12

Karbonkelberg Mountains

449

Restio

B

Sandstone/ Limestone

Did you notice anything strange in the data in Table 8.4? Congratulations if you did! Kirstenbosch is misspelled in the second record. Now imagine trying to spot this error in a table with thousands of records! By using the structure in Table 8.4, the chances of data anomalies increase dramatically.

The solution is simple. You remove the duplication. What you are doing is looking for partial dependencies—in other words, fields that are dependent on a part of a key and not the entire key. Because both the location code and the plant code make up the key, you look for fields that are dependent only on location code or on plant name.

There are quite a few fields where this is the case. Location name is dependent on location code (plant code is irrelevant in determining project name), and plant name, soil code, and soil name are all dependent on plant number. So, take out all these fields, as shown in Table 8.5.

Table 8.5: Removing the Fields Not Dependent on the Entire Key

Location Code

Plant Code

11

431

11

446

11

482

12

431

12

449

Clearly you can't remove the data and leave it out of the database completely. You take it out and put it into a new table, consisting of the fields that have the partial dependency and the fields on which they are dependent. For each of the key fields in the partial dependency, you create a new table (in this case, both are already part of the primary key, but this doesn't always have to be the case). So, you identified plant name, soil description, and soil category as being dependent on plant code. The new table will consist of plant code as a key, as well as plant name, soil category, and soil description, as shown in Table 8.6.

Table 8.6: Creating a New Table with Plant Data

Plant Code

Plant Name

Soil Category

Soil Description

431

Leucadendron

A

Sandstone

446

Protea

B

Sandstone/Limestone

482

Erica

C

Limestone

449

Restio

B

Sandstone/Limestone

You do the same process with the location data, as shown in Table 8.7.

Table 8.7: Creating a New Table with Location Data

Location Code

Location Name

11

Kirstenbosch Gardens

12

Karbonkelberg Mountains

See how these tables remove the earlier duplication problem? There is only one record that contains Kirstenbosch Gardens, so the chances of noticing a misspelling are much higher. And you aren't wasting space storing the name in many different records. Notice that the location code and plant code fields are repeated in two tables. These are the fields that create the relation, allowing you to associate the various plants with the various locations. Obviously there is no way to remove the duplication of these fields without losing the relation altogether, but it is far more efficient storing a small code repeatedly than a large piece of text.

But the table is still not perfect. There is still a chance for anomalies to slip in. Examine Table 8.8 carefully.

Table 8.8: Another Anomaly

Plant Code

Plant Name

Soil Category

Soil Description

431

Leucadendron

A

Sandstone

446

Protea

B

Sandstone/Limestone

482

Erica

C

Limestone

449

Restio

B

Sandstone

The problem in Table 8.8 is that the Restio has been associated with sandstone, when in fact, having a soil category of B, it should be a mix of sandstone and limestone. (The soil category determines the soil description in this example). Once again you are storing data redundantly: The soil category to soil description relationship is being stored in its entirety for each plant. As before, the solution is to take out this excess data and place it in its own table. What you are in fact doing at this stage is looking for transitive relationships, or relationships where a nonkey field is dependent on another nonkey field. Soil description, although in one sense dependent on plant code (it did seem to be a partial dependency when we looked at it in the previous step), is actually dependent on soil category. So, soil description must be removed: Once again, take it out and place it in a new table, along with its actual key (soil category), as shown in Table 8.9 and Table 8.10.

Table 8.9: Plant Data After Removing the Soil Description

Plant Code

Plant Name

Soil Category

431

Leucadendron

A

446

Protea

B

482

Erica

C

449

Restio

B

Table 8.10: Creating a New Table with the Soil Description

Soil Category

Soil Description

A

Sandstone

B

Sandstone/Limestone

C

Limestone

You've cut down the chance of anomalies once again. It is now impossible to mistakenly assume soil category B is associated with anything but a mix of sandstone and limestone. The soil description to soil category relationships is stored in only one place: the new soil table, where you can be sure they are accurate.

Let's look at this example without the data tables to guide you. Often when you're designing a system you don't yet have a complete set of test data available, and it's not necessary if you understand how the data relates. I've used the tables to demonstrate the consequences of storing data in tables that were not normalized, but without them you have to rely on dependencies between fields, which is the key to database normalization.

At first, the data structure was as follows:

Location code

Location name

1–n Plant numbers (1–n is a shorthand for saying there are many occurrences of this field—in other words, it is a repeating group)

1–n Plant names

1–n Soil categories

1–n Soil descriptions

This is a completely unnormalized structure—in other words, it is in zero normal form. So, to begin the normalization process, you start by moving from zero normal form to 1st normal form.

1st Normal Form

Tables in 1st normal form follow these rules:

  • There are no repeating groups.

  • All the key attributes are defined.

  • All attributes are dependent on the primary key.

What this means is that data must be able to fit into a tabular format, where each field contains one value. This is also the stage where the primary key is defined. Some sources claim that defining the primary key is not necessary for a table to be in 1st normal form, but it's usually done at this stage, and is necessary before we can progress to the next stage. Theoretical debates aside, you'll have to define your primary keys at this point.

Tip 

Although not always seen as part of the definition of 1st normal form, the principle of atomicity is usually applied at this stage as well. This means that all columns must contain their smallest parts, or be indivisible. A common example of this is where someone creates a name field, rather than first name and surname fields. They usually regret it later.

So far, the plant example has no keys, and there are repeating groups. To get it into 1st normal form, you'll need to define a primary key and change the structure so that there are no repeating groups; in other words, each row/column intersection contains one, and only one, value. Without this, you cannot put the data into the ordinary two-dimensional table that most databases require. You define location code and plant code as the primary key together (neither on its own can uniquely identify a record), and replace the repeating groups with a single-value attribute. After doing this, you are left with the data shown in Table 8.11.

Table 8.11: 1st Normal Form

Plant Location Table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

This table is in now in 1st normal form. Is it in 2nd normal form?

2nd Normal Form

A table is in 2nd normal form if it follows these rules:

  • Is in 1st normal form

  • Includes no partial dependencies (where an attribute is only dependent on part of a primary key)

Tip 

For an attribute to be only dependent on part of a primary key, the primary key must consist of more than one field. If the primary key contains only one field, the table is automatically in 2nd normal form if it is in 1st normal form.

Let's examine all the fields. Location name is only dependent on location code. Plant name, soil category, and soil description are only dependent on plant code. (This assumes that each plant only occurs in one soil type, which is the case in this example). So you remove each of these fields, and place them in a separate table, with the key being that part of the original key on which they are dependent. For example, with plant name, the key is plant code. This leaves you with Table 8.12, Table 8.13, and Table 8.14.

Table 8.12: The Plant Location Table with Partial Dependencies Removed

Plant Location Table

Plant code

Location code

Table 8.13: Table Resulting from Fields Dependent on Plant Code

Plant Table

Plant code

Plant name

Soil category

Soil description

Table 8.14: Table Resulting from Fields Dependent on Location Code

Location Table

Location code

Location name

The resulting tables are now in 2nd normal form. Are they in 3rd normal form?

3rd Normal Form

A table is in 3rd normal form if it follows these rules:

  • Is in 2nd normal form

  • Contains no transitive dependencies (where a nonkey attribute is dependent on the primary key through another nonkey attribute)

Tip 

If a table only contains one nonkey attribute, it is obviously impossible for a nonkey attribute to be dependent on another nonkey attribute. Any tables like these that are in 2nd normal form are then automatically in 3rd normal form.

As only the plant table has more than one nonkey attribute, you can ignore the others because they are in 3rd normal form already. All fields are dependent on the primary key in some way, since the tables are in 2nd normal form. But is this dependency through another nonkey field? Plant name is not dependent on either soil category or soil description. Nor is soil category dependent on either soil description or plant name. However, soil description is dependent on soil category. You use the same procedure as before, removing it, and placing it in its own table with the attribute that it was dependent on as the key. You are left with Table 8.15, Table 8.16, Table 8.17, and Table 8.18.

Table 8.15: The Plant Location Table Remains Unchanged

Plant Location Table

Plant code

Location code

Table 8.16: The Plant Table with Soil Description Removed

Plant Table

Plant code

Plant name

Soil category

Table 8.17: The New Soil Table

Soil Table

Soil category

Soil description

Table 8.18: The Location Table Is Unchanged

Location Table

Location code

Location name

All of these tables are now in 3rd normal form. 3rd normal form is usually sufficient for most tables, because it avoids the most common kind of data anomalies. I suggest getting most tables you work with to 3rd normal form before you implement them, as this will achieve the aims of normalization listed at the beginning of the chapter in the vast majority of cases. The normal forms beyond this, such as Boyce-Codd normal form and 4th normal form, are rarely useful for business applications. In most cases, tables in 3rd normal form are already in these normal forms anyway. But any skillful database practitioner should know the exceptions, and be able to normalize to the higher levels when required.

Boyce-Codd Normal Form

E.F. Codd and R.F. Boyce, two of the people instrumental in the development of the database model, have been honored by the name of this normal form. E.F. Codd developed and expanded the relational model, and also developed normalization for relational models in 1970, while R.F. Boyce was one of the creators of Structured Query Language (then called SEQUEL).

In spite of some resources stating the contrary, Boyce-Codd normal form is not the same as 4th normal form. Let's look at an example of data anomalies, which are presented in 3rd normal form and solved by transforming into Boyce-Codd normal form, before defining it (see Table 8.19).

Table 8.19: A Table Containing Data about the Student, Course, and Instructor Relationship

Student Course Instructor Table

Student

Course

Instructor

Assume that the following is true for Table 8.19:

  • Each instructor takes only one course.

  • Each course can have one or more instructors.

  • Each student has only one instructor per course.

  • Each student can take one or more courses.

What would the key be? None of the fields on their own would be sufficient to uniquely identify a record, so you have to use two fields. Which two should you use?

Perhaps student and instructor seem like the best choice, as that would allow you to determine the course. Or you could use student and course, which would determine the instructor. For now, let's use student and course as the key (see Table 8.20).

Table 8.20: Using Student and Course as the Key

Student Course Instructor Table

Student

Course

Instructor

What normal form is this table in? It's in first normal form, as it has a key and no repeating groups. It's also in 2nd normal form, as the instructor is dependent on both other fields (students have many courses and therefore instructors, and courses have many instructors). Finally, it's also in 3rd normal form, as there is only one nonkey attribute.

But there are still some data anomalies. Look at the data sample in Table 8.21.

Table 8.21: More Data Anomalies

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

The fact that Peter Leon teaches science is stored redundantly, as are Kader Dlamini with mathematics and Nkosizana Asmal with biology. The problem is that the instructor determines the course. Or put another way, course is determined by instructor. The table conforms to 3rd normal form rules because no nonkey attribute is dependent upon another nonkey attribute. However, a key attribute is dependent upon a nonkey attribute! Again, you can use the familiar method of removing this field and placing it into another table, along with its key (see Table 8.22 and Table 8.23).

Table 8.22: Student Instructor Table after Removing Course

Student Instructor Table

Student

Instructor

After removing the course field, the primary key needs to include both remaining fields to uniquely identify a record.

Table 8.23: Resulting Instructor Course Table

Instructor Course Table

Instructor

Course

Although we had chosen course as part of the primary key in the original table, the instructor determines the course, which is why we make it the primary key in this table. As you can see, the redundancy problem has been solved.

Thus, a table is in Boyce-Codd normal form if it meets the following conditions:

  • It is in 3rd normal form.

  • Each determinant is a candidate key.

That sounds scary! For most people new to database design, these are new terms. If you followed along with this example, however, the terms will soon become clear:

  • A determinant is an attribute that determines the value of another attribute.

  • A candidate key is either the key or an alternate key (in other words, the attribute could be a key for that table).

Instructor is not a candidate key (alone it cannot uniquely identify the record), yet it determines the course, so the table is not in Boyce-Codd normal form.

Let's look at the example again, and see what happens if you chose student and instructor as the key, as shown in Table 8.24. What normal form is the table in this time?

Table 8.24: Using Student and Instructor as the Key

Student Course Instructor Table

Student

Instructor

Course

Once again it's in 1st normal form because there is a primary key and there are no repeating groups. This time, though, it's not in 2nd normal form because course is determined by only part of the key: the instructor. By removing course and its key, instructor, you get the data shown in Table 8.25 and Table 8.26.

Table 8.25: Removing Course

Student Instructor Table

Student

Instructor

Table 8.26: Creating a New Table with Course

Instructor Course Table

Instructor

Course

Either way you do it, by making sure the tables are normalized into Boyce-Codd normal form, you get the same two resulting tables. It's usually the case that when there are alternate fields to choose as a key, it doesn't matter which ones you choose initially because after normalizing the results you get the same results either way.

4th Normal Form

Let's look at situation where redundancies can creep in even though a table is in Boyce-Codd normal form. Let's take the previous student/instructor/course example but change one of the initial assumptions. This time, a student can have several instructors for a single course (see Table 8.27).

Table 8.27: Student Course Instructor Data, with Several Instructors per Course

Student

Course

Instructor

Conrad Pienaar

Biology

Nkosizana Asmal

Dingaan Fortune

Mathematics

Kader Dlamini

Gerrie Jantjies

Science

Helen Ginwala

Mark Thobela

Biology

Nkosizana Asmal

Conrad Pienaar

Science

Peter Leon

Alicia Ncita

Science

Peter Leon

Quinton Andrews

Mathematics

Kader Dlamini

Dingaan Fortune

Mathematics

Helen Ginwala

The data is the same as before, except that Helen Ginwala is teaching science to Gerrie Jantjies as well as mathematics to Dingaan Fortune, and Dingaan Fortune is being taught by both Helen Ginwala and Kader Dlamini for mathematics.

The only possible key is a combination of all three attributes, as shown in Table 8.28. No other combination will uniquely identify a particular record.

Table 8.28: Three Attributes as Key

Student Course Instructor Table

Student

Instructor

Course

But this still has some potentially anomalous behavior. The fact that Kader Dlamini teaches mathematics is still stored more than once, as is the fact that Dingaan Fortune takes mathematics. The real problem is that the table stores more than one kind of fact: that of a student-to-course relationship, as well as that of a student-to-instructor relationship. You can avoid this, as always, by separating the data into two tables, as shown in Table 8.29 and Table 8.30.

Table 8.29: Creating a Table for the Student to Instructor Relationship

Student Instructor Table

Student

Instructor

Table 8.30: Creating a Table for the Student to Course Relationship

Student Course Table

Student

Course

This situation exists when you have multiple multivalued dependencies. A multivalued dependency exists between two attributes when, for each value of the first attribute, there is one or more associated values of the second attribute. For each value of student, there were many values of course. This is the first multivalued dependency. Then for each value of student, there are one or more associated values of instructor. This is the second multivalued dependency.

Thus, a table is in 4th normal form if it meets the following criteria:

  • Is in Boyce-Codd normal form

  • Does not contain more than one multivalued dependency

5th Normal Form and Beyond

There are further normal forms that are mainly of academic interest, as the problems they exist to solve rarely appear in practice. I won't go into them in much detail, but for those who are interested, the following example provides a taste (see Table 8.31):

Table 8.31: The Sales Rep Example

Sales Rep

Company

Product

Felicia Powers

Exclusive

Books

Afzal Ignesund

Wordsworth

Magazines

Felicia Powers

Exclusive

Magazines

Usually you would store this data in one table, as you need all three records to see which combinations are valid. Afzal Ignesund sells magazines for Wordsworth, but not necessarily books. Felicia Powers happens to sell both books and magazines for Exclusive. However, let's add another condition: If a sales rep sells a certain product, and they sell it for a particular company, then they must sell that product for that company.

Let's look at a larger data set adhering to this condition (see Table 8.32).

Table 8.32: Looking at a Larger Set of Data

Sales Rep

Company

Product

Felicia Powers

Exclusive

Books

Felicia Powers

Exclusive

Magazines

Afzal Ignesund

Wordsworth

Books

Felicia Powers

Wordsworth

Books

Felicia Powers

Wordsworth

Magazines

Now, with this extra dependency, you could normalize Table 8.32 further into three separate tables without losing any facts, as shown in Table 8.33, Table 8.34, and Table 8.35.

Table 8.33: Creating a Table with Sales Rep and Product

Sales Rep

Product

Felicia Powers

Books

Felicia Powers

Magazines

Afzal Ignesund

Books

Table 8.34: Creating a Table with Sales Rep and Company

Sales Rep

Company

Felicia Powers

Exclusive

Felicia Powers

Wordsworth

Afzal Ignesund

Wordsworth

Table 8.35: Creating a Table with Company and Product

Company

Product

Exclusive

Books

Exclusive

Magazines

Wordsworth

Books

Wordsworth

Magazines

Basically, a table is in 5th normal form if it cannot be made into any smaller tables with different keys (most tables can obviously be made into smaller tables with the same key!).

Beyond 5th normal form, you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer is similar to that of the concept of infinity to a bookkeeper—i.e., it exists in theory, but is not going to be used in practice. Even the most corrupt executive is not going to expect that of the bookkeeper!

For those interested in pursuing this academic and highly theoretical topic further, I suggest obtaining a copy of An Introduction to Database Systems by C.J. Date (Addison-Wesley, 1999).



Mastering MySQL 4
Mastering MySQL 4
ISBN: 0782141625
EAN: 2147483647
Year: 2003
Pages: 230
Authors: Ian Gilfillan

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net