Given what we've seen so far, wouldn't it be wonderful if there was a standard way to: (a) reduce row lengths by splitting rows vertically, (b) isolate different data on different files so it could be partitioned, and (c) mitigate fragmentation by cramming truly similar data into the same pages? Well, there is a standard way. It's called normalization.
Some people think that normalizing is like Latinsupposedly, learning Latin helps you think clearly and look good, but it's an optional subject. Well, if that's so, it's odd that so many plumbers and engineerspeople who slept through all the artsy stuff in schoolinsist on using normalization. Perhaps it's time to take a second look at the subject.
Consider Table 8-5, which shows an unnormalized "table" called Diplomatic_1 . Diplomatic_1 contains the following information: the name of a diplomat, the language the diplomat speaks, the number of years the diplomat has spoken the language, the diplomat's title, the diplomat's length of service with the diplomatic corps, the diplomat's salary, the group with which the diplomat is affiliated , and the name of the person who heads the group. Each diplomat has a unique name, so the name column can be used as the table's primary key, as indicated by the fact that the name heading is underlined in Table 8-5. (In practice, a key can usually be constructed in more than one wayfor example, a social security number is another value that can uniquely identify an individual.)
Requests of the form "Tell me something about a diplomat" are easily answered by the structure of the data shown in Table 8-5. But answers to requests like "Which diplomats speak Greek?" and "Change the diplomat in charge of the FTA" are far more difficult. Normalization changes the organization of a table so that you can use its data more flexibly.
The first step in normalizing is based on the relational rule that each column of a table may contain only one, atomic, value per row. Looking at Table 8-5, it's easy to see that the language , years_used , work_group , and head_honcho columns violate this rule, because a diplomat may speak multiple languages and may be affiliated with multiple groups. If we add new rows by duplicating the nonrepeating values for each combination of repeating values though, Table 8-5 will be represented in first normal form (1NF). (A 1NF table contains only scalar values.) The Diplomatic_2 table shown in Table 8-6 is the Diplomatic_1 table normalized to first normal form.
Table 8-5. Diplomatic_1 Table
Take another look at Table 8-6. Now that Diplomatic_2 is in first normal form, the name column is no longer sufficient to uniquely identify a row, because several rows may exist for a diplomat who speaks multiple languages or is affiliated with more than one group. One solution to this problem is to create a new, compound primary keyfor example, (name , language , work_group) combinedto uniquely identify a single row of Diplomatic_2 .
Table 8-6 appears to be a step backward. Not only does it require more space to present the data, but responding to requests such as "Change Temple's title" and "Add Campbell to another group" is now more time consuming. This problem is addressed by the remaining normalization steps, which are based on the concept of dependence and the relational rule thatin every row of a tableeach column must be dependent on every part of the primary key.
Briefly, the concept of dependence has two rules. First, if the value of column1 uniquely determines the value of column2 , then column2 is functionally dependent on column1 . Second, if the value of column1 limits the possible values in column2 to a specific set, then column2 is set dependent on column1 . For example, because each diplomat has only one title, name will uniquely determine title therefore, title is functionally dependent on name . Further, work_group is set dependent on name , because each diplomat is assigned to one or more of a specific set of groups.
The concept of dependence tells us that the title , service_length , and salary columns are not dependent on the entire primary key (name , language , work_group) of Diplomatic_2 ; they are dependent on name alone. Because this violates the relational rule, let's create a new tablecontaining only name , title , service_length , and salary . The key for this table will, once again, be name . Let's call the new table Diplomats .
Of the remaining Diplomatic_2 columns, years_used is determined by both name and language and therefore doesn't properly belong to Diplomats , so let's create another new tablecalled Languages using just these three columns. The Languages table's key is (name , language) . Because Languages also contains the name column, it is still possible to associate a diplomat's language experience with his or her other data.
Splitting a table like this prevents it from having columns that are dependent on only part of the table's key. A first normal form table that also has no partial key dependence is said to be in second normal form (2NF). (A 2NF table is a 1NF table that contains only columns that are dependent upon the entire primary key.) Both Diplomats (Table 8-7) and Languages (Table 8-8) are in second normal form.
Table 8-6. Diplomatic_2 Table
The situation with the Diplomatic_2.work_group column is slightly different. We have already noted that a diplomat's name determines the set of groups with which that diplomat is affiliated. This information is independent of the languages spoken by the diplomat, so a table with the work_group column shouldn't have language in its primary key. But work_group does uniquely determine a group's head_honcho .
Remember the Law of Transitivity from Chapter 2, "Simple Searches"? Well, it applies to dependence as well. That is, if column2 is dependent on column1 and column3 is dependent on column2 , then it is also true that column3 is dependent on column1 . This is known as a transitive dependence column3 is transitively dependent on column1 , via column2 .
In our example, head_honcho is transitively dependent on name , because name determines a set of values for work_group , and head_honcho is functionally dependent on work_group . A second normal form table that has no transitive dependence is said to be in third normal form (3NF) and thus fulfills the relational requirement thatin every row of a tableall columns must depend directly on the primary key, without any transitive dependencies through other columns. (A 3NF table is a 2NF table whose non-key columns are also mutually independent; that is, each column can be updated independently of all the rest.)
Because each work_group has only one head_honcho , let's finish our normalization design by creating two more new tables. The first, called Groups , will contain the columns work_group and head_honcho . The Groups table's primary key is work_group . ( head_honcho could also be a key if each diplomat managed only one work_group .) And finally, because each diplomat is affiliated with one or more groups, we'll create an Affiliations table, using the columns name and work_group . Affiliations forms the association between the Diplomats and Groups tables and is "all key"that is, Affiliations has no additional dependent columns, because the only thing dependent on both name and work_group is the fact that they are associated. Table 8-9 shows the third normal form Groups table, and Table 8-10 shows the third normal form Affiliations table.
Table 8-7. Diplomats Table
Table 8-8. Languages Table
Table 8-9. Groups Table
Table 8-10. Affiliations Table
At this point, our normalization process is complete. All tables are in third normal form, and requests such as those listed earlier can easily be dealt with.
Here are some tips for good database design:
Breaking Normalization Rules
Still on the subject of normalization, strictly speaking, an error isn't a performance problem unless you correct it. So the following observations won't convince the true cowboys in the crowd , because they only affect performance when results must be correct. Um, that still sounds nontrivial, so let's observe two scenarios.
Scenario #1: BREAK 1NF
InterBase, Informix, and Oracle have supported arrays for many years, and now ARRAY is an SQL:1999 data type. Another SQL:1999 collection data type, ROW, is supported by Informix. Suppose we have a table that contains addresses. We want to split the address column into four parts becausewhen the data is printedfour lines are used for the address. (In effect, we want to break first normal form.) So, instead of using Definition #1, we make our Addresses table with either Definition #2 or Definition #3:
Definition #1: CREATE TABLE Addresses ( identifier INTEGER PRIMARY KEY, address CHARACTER(100), ... ) Definition #2: CREATE TABLE Addresses ( /* Use ARRAY data type */ identifier INTEGER PRIMARY KEY, address CHAR(25) ARRAY, ... ) Definition #3: CREATE TABLE Addresses ( /* Use ROW data type */ identifier INTEGER PRIMARY KEY, address ROW (r1 CHAR(25), r2 CHAR(25), r3 CHAR(25), r4 CHAR(25)), ... )
At this point, someone needs an answer to the questionHow many addresses are in New York? We can no longer answer the question with this query:
SELECT * FROM Addresses WHERE address LIKE '%New York%'
Instead, we'll have to use one of these queries to get the answer which takes considerably longer to execute. (Yes we tried it, on Informix, whose ROW data type complies with the SQL Standard.)
SELECT * FROM Addresses /* for the ARRAY data type */ WHERE address LIKE '%New York%' OR address LIKE '%New York%' OR address LIKE '%New York%' OR address LIKE '%New York%' SELECT * FROM Addresses /* for the ROW data type */ WHERE address.r1 LIKE '%New York%' OR address.r2 LIKE '%New York%' OR address.r3 LIKE '%New York%' OR address.r4 LIKE '%New York%'
And herein lies the problem with breaking 1NF just to make the printing job easier. Not only are the collection type queries slower, they're not even strictly correct New York could have been split over two or more lines.
Scenario #2: BREAK 2NF
Now, suppose we create two more tables, as follows :
CREATE TABLE States ( state_code CHARACTER(2) PRIMARY KEY, state_name CHARACTER(25)) CREATE TABLE Cities ( identifier INTEGER PRIMARY KEY, city_name CHARACTER(25), state_code CHARACTER(2))
One of the rows in Cities contains a state_code value of AL . We can look up AL in the States table to determine that AL means Alaska . But we find that when we print addresses we must do joins to display the name of the state:
SELECT identifier, city_name, state_name FROM Cities, States WHERE Cities.state_code = States.state_code
Because we don't want to waste time doing the join, we duplicate the state_name in the Cities table and end up with the tables shown in Table 8-11 and Table 8-12.
With the new Cities table, we don't have to join any more. We can simply execute this SQL statement:
SELECT identifier, city_name, state_name FROM Cities
We've broken second normal form ( state_name is dependent on state_code , not just identifier )but performance will obviously improve.
Time passes .
At this point, we find that some user , who didn't understand the system, thought state_name in the Cities table is an updatable column. So she changed it to Saudi Arabia but forgot to update state_code as well. To prevent that from happening again, we add a CHECK constraint to Cities :
ALTER TABLE Cities ADD CONSTRAINT State_Check CHECK ( (state_code, state_name) = (SELECT state_code, state_name FROM States)) /* this could also be done with a FOREIGN KEY constraint */
Table 8-11. Cities Table
Table 8-12. States Table
This is inevitable. Whenever redundant columns are added, errors creep in, and we correct them by adding constraints or triggers. (By the way, not all DBMSs will accept the syntax of this CHECK constraint.) So let's take stock:
The point we want to make here is that denormalizing only appears to make things faster. That is, when we deliberately denormalized , we failed to make an improvement overall. In fact, it appears that denormalizing improved one thing, but caused an offsetting problem to pop out somewhere else.
Relational proponents like to notice that sort of thing. They say it proves that you might as well try to optimize for a broad range of cases, because when you optimize for a single case, you muck things up elsewhere. And here we can address the claim that "it is practical" to denormalize . That's truebut "it" is a singular pronoun. Normalizing is an attempt to optimize generally . Denormalizing is an attempt to normalize for a particular application, to introduce a bias.
By following normalization rules, you will be splitting a tablefor example, where the original table was Table1 (column1 , column2 , column3 , column4) , you end up with Table1 (column1 , column2 , T2_ID) and Table2 (T2_ID , column3 , column4) . The advantages of this arrangement are three-fold :
The Bottom Line: Normalization
Normalizing means smaller rows. Smaller rows mean faster retrievals because (a) the chance of overflowing the operating system cache is smaller, (b) a single page read retrieves more records into cache, and (c) byte-by-byte comparisons involve less bytes. On the other hand, (a) fragmentation is more likely as rows are updated or deleted, (b) string comparisons always involve an initial read of the size (the parser cannot decide in advance how many repetitions to plug into the loop that it produces), and (c) row position within a file can't be calculated in advance because row start is not a simple matter of saying ( row size * row number ).
Normalizing also has these effects: (a) table scans for a particular column are twice as fast (on average) because the number of rows per page is greater, (b) parallel queries are possible, (c) fetch and lock times are reduced, and (d) INSERTs are slowed by primary-key/foreign-key integrity checks.
Normalizing helps design and programming, and it makes things more efficient too (at least as far as 3NF, which is easy to reach with case tools). The downside is that too many slow joins happen if you ask for all information related to an entitybut really, you should be thinking about making the join faster or cutting out unnecessary joins rather than opting to denormalize.