The Normal Forms | SQL Performance Tuning

Given what we've seen so far, wouldn't it be wonderful if there was a standard way to: (a) reduce row lengths by splitting rows vertically, (b) isolate different data on different files so it could be partitioned, and (c) mitigate fragmentation by cramming truly similar data into the same pages? Well, there is a standard way. It's called normalization.

Some people think that normalizing is like Latinsupposedly, learning Latin helps you think clearly and look good, but it's an optional subject. Well, if that's so, it's odd that so many plumbers and engineerspeople who slept through all the artsy stuff in schoolinsist on using normalization. Perhaps it's time to take a second look at the subject.

Consider Table 8-5, which shows an unnormalized "table" called Diplomatic_1 . Diplomatic_1 contains the following information: the name of a diplomat, the language the diplomat speaks, the number of years the diplomat has spoken the language, the diplomat's title, the diplomat's length of service with the diplomatic corps, the diplomat's salary, the group with which the diplomat is affiliated , and the name of the person who heads the group. Each diplomat has a unique name, so the name column can be used as the table's primary key, as indicated by the fact that the name heading is underlined in Table 8-5. (In practice, a key can usually be constructed in more than one wayfor example, a social security number is another value that can uniquely identify an individual.)

Requests of the form "Tell me something about a diplomat" are easily answered by the structure of the data shown in Table 8-5. But answers to requests like "Which diplomats speak Greek?" and "Change the diplomat in charge of the FTA" are far more difficult. Normalization changes the organization of a table so that you can use its data more flexibly.

The first step in normalizing is based on the relational rule that each column of a table may contain only one, atomic, value per row. Looking at Table 8-5, it's easy to see that the language , years_used , work_group , and head_honcho columns violate this rule, because a diplomat may speak multiple languages and may be affiliated with multiple groups. If we add new rows by duplicating the nonrepeating values for each combination of repeating values though, Table 8-5 will be represented in first normal form (1NF). (A 1NF table contains only scalar values.) The Diplomatic_2 table shown in Table 8-6 is the Diplomatic_1 table normalized to first normal form.

Table 8-5. Diplomatic_1 Table

name	language	years_used	title	service_length	salary	work_group	head_honcho
====
Axworthy	French German	3 2	Consul	4	30,000.00	WHO IMF	Greene Craig
Broadbent	Russian Greek	1 3	Diplomat	2	25,000.00	IMF FTA	Craig Crandall
Campbell	French Spanish Italian	2 1 3	Consul	3	28,000.00	EA	Temple
Craig	French Greek Russian Spanish	1 5 2 9	Ambassador	8	65,000.00	IMF	Craig
Crandall	French	9	Ambassador	3	55,000.00	FTA	Crandall
Greene	French Spanish Italian Japanese	3 7 1 4	Ambassador	9	70,000.00	WHO	Greene
Temple	French Russian	4 2	Ambassador	2	60,000.00	EA	Temple

Take another look at Table 8-6. Now that Diplomatic_2 is in first normal form, the name column is no longer sufficient to uniquely identify a row, because several rows may exist for a diplomat who speaks multiple languages or is affiliated with more than one group. One solution to this problem is to create a new, compound primary keyfor example, (name , language , work_group) combinedto uniquely identify a single row of Diplomatic_2 .

Table 8-6 appears to be a step backward. Not only does it require more space to present the data, but responding to requests such as "Change Temple's title" and "Add Campbell to another group" is now more time consuming. This problem is addressed by the remaining normalization steps, which are based on the concept of dependence and the relational rule thatin every row of a tableeach column must be dependent on every part of the primary key.

Briefly, the concept of dependence has two rules. First, if the value of column1 uniquely determines the value of column2 , then column2 is functionally dependent on column1 . Second, if the value of column1 limits the possible values in column2 to a specific set, then column2 is set dependent on column1 . For example, because each diplomat has only one title, name will uniquely determine title therefore, title is functionally dependent on name . Further, work_group is set dependent on name , because each diplomat is assigned to one or more of a specific set of groups.

The concept of dependence tells us that the title , service_length , and salary columns are not dependent on the entire primary key (name , language , work_group) of Diplomatic_2 ; they are dependent on name alone. Because this violates the relational rule, let's create a new tablecontaining only name , title , service_length , and salary . The key for this table will, once again, be name . Let's call the new table Diplomats .

Of the remaining Diplomatic_2 columns, years_used is determined by both name and language and therefore doesn't properly belong to Diplomats , so let's create another new tablecalled Languages using just these three columns. The Languages table's key is (name , language) . Because Languages also contains the name column, it is still possible to associate a diplomat's language experience with his or her other data.

Splitting a table like this prevents it from having columns that are dependent on only part of the table's key. A first normal form table that also has no partial key dependence is said to be in second normal form (2NF). (A 2NF table is a 1NF table that contains only columns that are dependent upon the entire primary key.) Both Diplomats (Table 8-7) and Languages (Table 8-8) are in second normal form.

Table 8-6. Diplomatic_2 Table

name	language	years_used	title	service_length	salary	work_group	head_honcho
====	=======					=========
Axworthy	French	3	Consul	4	30,000.00	WHO	Greene
Axworthy	German	2	Consul	4	300,00.00	IMF	Craig
Broadbent	Russian	1	Diplomat	2	25,000.00	IMF	Craig
Broadbent	Greek	3	Diplomat	2	25,000.00	FTA	Crandall
Campbell	French	2	Consul	3	28,000.00	EA	Temple
Campbell	Spanish	1	Consul	3	28,000.00	EA	Temple
Campbell	Italian	3	Consul	3	28,000.00	EA	Temple
Craig	French	1	Ambassador	8	65,000.00	IMF	Craig
Craig	Greek	5	Ambassador	8	65,000.00	IMF	Craig
Craig	Russian	2	Ambassador	8	65,000.00	IMF	Craig
Craig	Spanish	9	Ambassador	8	65,000.00	IMF	Craig
Crandall	French	9	Ambassador	3	55,000.00	FTA	Crandall
Greene	French	3	Ambassador	9	70,000.00	WHO	Greene
Greene	Spanish	7	Ambassador	9	70,000.00	WHO	Greene
Greene	Italian	1	Ambassador	9	70,000.00	WHO	Greene
Greene	Japanese	4	Ambassador	9	70,000.00	WHO	Greene
Temple	French	4	Ambassador	2	60,000.00	EA	Temple
Temple	Russian	2	Ambassador	2	60,000.00	EA	Temple

The situation with the Diplomatic_2.work_group column is slightly different. We have already noted that a diplomat's name determines the set of groups with which that diplomat is affiliated. This information is independent of the languages spoken by the diplomat, so a table with the work_group column shouldn't have language in its primary key. But work_group does uniquely determine a group's head_honcho .

Remember the Law of Transitivity from Chapter 2, "Simple Searches"? Well, it applies to dependence as well. That is, if column2 is dependent on column1 and column3 is dependent on column2 , then it is also true that column3 is dependent on column1 . This is known as a transitive dependence column3 is transitively dependent on column1 , via column2 .

In our example, head_honcho is transitively dependent on name , because name determines a set of values for work_group , and head_honcho is functionally dependent on work_group . A second normal form table that has no transitive dependence is said to be in third normal form (3NF) and thus fulfills the relational requirement thatin every row of a tableall columns must depend directly on the primary key, without any transitive dependencies through other columns. (A 3NF table is a 2NF table whose non-key columns are also mutually independent; that is, each column can be updated independently of all the rest.)

Because each work_group has only one head_honcho , let's finish our normalization design by creating two more new tables. The first, called Groups , will contain the columns work_group and head_honcho . The Groups table's primary key is work_group . ( head_honcho could also be a key if each diplomat managed only one work_group .) And finally, because each diplomat is affiliated with one or more groups, we'll create an Affiliations table, using the columns name and work_group . Affiliations forms the association between the Diplomats and Groups tables and is "all key"that is, Affiliations has no additional dependent columns, because the only thing dependent on both name and work_group is the fact that they are associated. Table 8-9 shows the third normal form Groups table, and Table 8-10 shows the third normal form Affiliations table.

Table 8-7. Diplomats Table

name	title	service_length	salary
====
Axworthy	Consul	4	30,000.00
Broadbent	Diplomat	2	25,000.00
Campbell	Consul	3	28,000.00
Craig	Ambassador	8	65,000.00
Crandall	Ambassador	3	55,000.00
Greene	Ambassador	9	70,000.00
Temple	Ambassador	2	60,000.00

Table 8-8. Languages Table

name	language	years_used
====	=======
Axworthy	French	3
Axworthy	German	2
Broadbent	Russian	1
Broadbent	Greek	3
Campbell	French	2
Campbell	Spanish	1
Campbell	Italian	3
Craig	French	1
Craig	Greek	5
Craig	Russian	2
Craig	Spanish	9
Crandall	French	9
Greene	French	3
Greene	Spanish	7
Greene	Italian	1
Greene	Japanese	4
Temple	French	4
Temple	Russian	2

Table 8-9. Groups Table

work_group	head_honcho
=========
EA	Temple
FTA	Crandall
IMF	Craig
WHO	Greene

Table 8-10. Affiliations Table

name	work_group
====	=========
Axworthy	WHO
Axworthy	IMF
Broadbent	IMF
Broadbent	FTA
Campbell	EA
Craig	IMF
Crandall	FTA
Greene	WHO
Temple	EA

At this point, our normalization process is complete. All tables are in third normal form, and requests such as those listed earlier can easily be dealt with.

Here are some tips for good database design:

Don't use an existing database as the basis for a new database structureyou don't want to inadvertently duplicate awkward or inconsistent table definitions.
Avoid unnecessary duplication of datamake sure each table represents just one subject.
Define a primary key for every tablenot only will it uniquely identify a row, you'll use it to join tables. A primary key must have unique, NOT NULL values that, preferably, change only rarely.
Define unique keys and foreign keys for your tables.

Breaking Normalization Rules

Still on the subject of normalization, strictly speaking, an error isn't a performance problem unless you correct it. So the following observations won't convince the true cowboys in the crowd , because they only affect performance when results must be correct. Um, that still sounds nontrivial, so let's observe two scenarios.

Scenario #1: BREAK 1NF

InterBase, Informix, and Oracle have supported arrays for many years, and now ARRAY is an SQL:1999 data type. Another SQL:1999 collection data type, ROW, is supported by Informix. Suppose we have a table that contains addresses. We want to split the address column into four parts becausewhen the data is printedfour lines are used for the address. (In effect, we want to break first normal form.) So, instead of using Definition #1, we make our Addresses table with either Definition #2 or Definition #3:

 Definition #1: CREATE TABLE Addresses (   identifier INTEGER PRIMARY KEY,   address CHARACTER(100),   ... ) Definition #2: CREATE TABLE Addresses (          /* Use ARRAY data type */    identifier INTEGER PRIMARY KEY,    address CHAR(25) ARRAY[4],    ... ) Definition #3: CREATE TABLE Addresses (          /* Use ROW data type */   identifier INTEGER PRIMARY KEY,   address ROW (r1 CHAR(25),                r2 CHAR(25),                r3 CHAR(25),                r4 CHAR(25)),   ... )

At this point, someone needs an answer to the questionHow many addresses are in New York? We can no longer answer the question with this query:

 SELECT * FROM Addresses   WHERE address LIKE '%New York%'

Instead, we'll have to use one of these queries to get the answer which takes considerably longer to execute. (Yes we tried it, on Informix, whose ROW data type complies with the SQL Standard.)

 SELECT * FROM Addresses     /* for the ARRAY data type */   WHERE address[1] LIKE '%New York%'      OR address[2] LIKE '%New York%'      OR address[3] LIKE '%New York%'      OR address[4] LIKE '%New York%' SELECT * FROM Addresses     /* for the ROW data type */   WHERE address.r1 LIKE '%New York%'      OR address.r2 LIKE '%New York%'      OR address.r3 LIKE '%New York%'      OR address.r4 LIKE '%New York%'

And herein lies the problem with breaking 1NF just to make the printing job easier. Not only are the collection type queries slower, they're not even strictly correct New York could have been split over two or more lines.

Scenario #2: BREAK 2NF

Now, suppose we create two more tables, as follows :

 CREATE TABLE States (   state_code CHARACTER(2) PRIMARY KEY,   state_name CHARACTER(25)) CREATE TABLE Cities (   identifier INTEGER PRIMARY KEY,   city_name CHARACTER(25),   state_code CHARACTER(2))

One of the rows in Cities contains a state_code value of AL . We can look up AL in the States table to determine that AL means Alaska . But we find that when we print addresses we must do joins to display the name of the state:

 SELECT identifier, city_name, state_name   FROM Cities, States   WHERE Cities.state_code = States.state_code

Because we don't want to waste time doing the join, we duplicate the state_name in the Cities table and end up with the tables shown in Table 8-11 and Table 8-12.

With the new Cities table, we don't have to join any more. We can simply execute this SQL statement:

 SELECT identifier, city_name, state_name   FROM Cities

We've broken second normal form ( state_name is dependent on state_code , not just identifier )but performance will obviously improve.

Time passes .

At this point, we find that some user , who didn't understand the system, thought state_name in the Cities table is an updatable column. So she changed it to Saudi Arabia but forgot to update state_code as well. To prevent that from happening again, we add a CHECK constraint to Cities :

 ALTER TABLE Cities   ADD CONSTRAINT State_Check CHECK (       (state_code, state_name) =          (SELECT state_code, state_name FROM States)) /* this could also be done with a FOREIGN KEY constraint */

Table 8-11. Cities Table

identifier	city_name	state_code	state_name
1	Riyadh	AL	Alaska
2	Medina	AL	Alaska
3	Mecca	AL	Alaska

Table 8-12. States Table

state_code	state_name
KY	Kentucky
AL	Alaska
NY	New York

This is inevitable. Whenever redundant columns are added, errors creep in, and we correct them by adding constraints or triggers. (By the way, not all DBMSs will accept the syntax of this CHECK constraint.) So let's take stock:

All the rows in the Cities table are slightly larger. As we know from the last chapter, this means when we SELECT from Cities alone, the selection is slower.
The Cities table is subject to a new constraint. That means INSERTs will be slower. In fact, because the constraint references a far table, INSERTs could slow down by just as much as we saved by not joining on the SELECTs.
One other thing. AL doesn't stand for Alaska it stands for Alabama . When we fix that, we'll have to remember to fix it in two places.

The point we want to make here is that denormalizing only appears to make things faster. That is, when we deliberately denormalized , we failed to make an improvement overall. In fact, it appears that denormalizing improved one thing, but caused an offsetting problem to pop out somewhere else.

Relational proponents like to notice that sort of thing. They say it proves that you might as well try to optimize for a broad range of cases, because when you optimize for a single case, you muck things up elsewhere. And here we can address the claim that "it is practical" to denormalize . That's truebut "it" is a singular pronoun. Normalizing is an attempt to optimize generally . Denormalizing is an attempt to normalize for a particular application, to introduce a bias.

By following normalization rules, you will be splitting a tablefor example, where the original table was Table1 (column1 , column2 , column3 , column4) , you end up with Table1 (column1 , column2 , T2_ID) and Table2 (T2_ID , column3 , column4) . The advantages of this arrangement are three-fold :

Tables are narrower. Therefore, scans are quicker.
There are more tables. Therefore partitioning is easier.
Irrelevant information is separated out. Therefore a data change affecting column4 won't block a transaction affecting column1 locks will be less frequent.

The Bottom Line: Normalization

"There is basically no need for OLTP denormalization today."

Richard Yevich and Susan Lawson, DB2 High Performance Design and Tuning , Prentice Hall PTR

Normalizing means smaller rows. Smaller rows mean faster retrievals because (a) the chance of overflowing the operating system cache is smaller, (b) a single page read retrieves more records into cache, and (c) byte-by-byte comparisons involve less bytes. On the other hand, (a) fragmentation is more likely as rows are updated or deleted, (b) string comparisons always involve an initial read of the size (the parser cannot decide in advance how many repetitions to plug into the loop that it produces), and (c) row position within a file can't be calculated in advance because row start is not a simple matter of saying ( row size * row number ).

Normalizing also has these effects: (a) table scans for a particular column are twice as fast (on average) because the number of rows per page is greater, (b) parallel queries are possible, (c) fetch and lock times are reduced, and (d) INSERTs are slowed by primary-key/foreign-key integrity checks.

Normalizing helps design and programming, and it makes things more efficient too (at least as far as 3NF, which is easy to reach with case tools). The downside is that too many slow joins happen if you ask for all information related to an entitybut really, you should be thinking about making the join faster or cutting out unnecessary joins rather than opting to denormalize.