Normalization and Denormalization | DB2 Developers Guide (5th Edition)

< Day Day Up >

One of the biggest challenges of DB2 database design and implementation can be handling normalization. This section will help you by defining normalization and denormalization and providing tips and guidelines on when and how to denormalize .

Normalization

Normalization is a database design approach that minimizes data redundancy and optimizes data structures by systematically and properly placing data elements into appropriate groupings. A normalized data model can be translated into a physical database that is organized correctly.

Another way of putting it is that normalization is the process of putting one fact in one appropriate place. E.F. Codd, the father of the relational model, also created normalization. Like the relational model, normalization is based on the mathematical principles of set theory.

It is important to remember that normalization is a logical process and does not necessarily dictate physical database design. A normalized data model will ensure that each entity is well- formed and that each attribute is assigned to the proper entity.

Of course, the best situation is when a normalized logical data model can be physically implemented without major modifications. But keep in mind that normalization optimizes updates at the expense of retrievals. When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This tends to slow the retrieval process. Updating is quicker, however, because the fact you're updating exists in only one place.

Your DB2 tables should be based on a normalized logical data model. With a normalized data model, one fact is stored in one place, related facts about a single entity are stored together, and every column of each entity refers non-transitively to only the unique identifier for that entity.

Although an in-depth discussion of normalization is beyond the scope of this book, brief definitions of the first three normal forms follow.

In first normal form , all entities must have a unique identifier, or key, that can be composed of one or more attributes. In addition, all attributes must be atomic and non-repeating. ( Atomic means that the attribute must not be composed of multiple attributes. For example, EMPNO should not be composed of SSN and FIRSTNAME because these are separate attributes.)
In second normal form , all attributes that are not part of the key must depend on the entire key for that entity.
In third normal form , all attributes that are not part of the key must not depend on any other non-key attributes.

So, normalization can be summed up as "every attribute must depend upon the key, the whole key, and nothing but the key, so help me Codd."

Denormalization

Speeding the retrieval of data from DB2 tables is a frequent requirement for DBAs and performance analysts. One way to accomplish this is to denormalize DB2 tables for physical implementation. The opposite of normalization, denormalization is the process of putting one fact in many places. This speeds data retrieval at the expense of data modification. This is not necessarily a bad decision, but should be undertaken only when a completely normalized design will not perform optimally.

All that is really required to physically implement a DB2 database is first normal form ”that is, the data must be in rows and columns . But that is the bare minimum and anything less than fully normalized can cause data integrity problems and needlessly complex INSERT , UPDATE , and DELETE operations.

Denormalization should be done based on the processing needs of applications accessing the data. Any denormalization decision must be based on how much the resulting design will optimize the most important queries. On the other hand, you must consider how difficult it will become to modify the data once it is implemented as a denormalized design. Consider these issues before denormalizing:

Can the system achieve acceptable performance without denormalization?
Will denormalization render the database design unuseable for ad hoc queries? (Is specialized expertise required to code queries against the denormalized design?)
Will the performance of the system still be unacceptable after denormalization?
Will the system be less reliable due to denormalization?

If the answer to any of these questions is "yes," you should not denormalize your tables because the benefit will not exceed the cost. If, after considering these issues, you decide to denormalize, there are rules you should follow.

If enough DASD is available, create the fully normalized tables and populate denormalized versions using the normalized tables. Access the denormalized tables in a read-only fashion. Create a controlled and scheduled population function to keep denormalized and normalized tables synchronized.
If sufficient DASD does not exist, maintain denormalized tables programmatically. Be sure to update each denormalized table representing the same entity at the same time; alternatively, provide a rigorous schedule whereby table updates are synchronized. If you cannot avoid inconsistent data, inform all users of the implications.
When updating any column that is replicated in many tables, update all copies simultaneously, or as close to simultaneously as possible given the physical constraints of your environment.
If denormalized tables are ever out of sync with the normalized tables, be sure to inform users that batch reports and online queries may not show up-to-date information.
Design the application so that it can be easily converted from denormalized tables to normalized tables.

There is only one reason to denormalize a relational design: performance. Several indicators help identify systems and tables that are candidates for denormalization. These indicators follow:

Many critical queries and reports rely on data from more than one table. Often these requests must be processed in an online environment.
Repeating groups must be processed in a group instead of individually.
Many calculations must be applied to one or many columns before queries can be answered successfully.
Tables must be accessed in different ways by different users during the same timeframe.
Many large primary keys are clumsy to query and use a large amount of DASD when carried as foreign key columns in related tables.
Certain columns are queried a large percentage of the time. (Consider 60% or greater as a cautionary number flagging denormalization as an option.)

Many types of denormalized tables work around the problems caused by these indicators. Table 5.10 summarizes the types of denormalization, with a short description of when each type is useful. The sections that follow describe these denormalization types in greater detail.

Table 5.10. Types of Denormalization

Denormalization	Use
Pre-joined Tables	When the cost of joining is prohibitive
Report Tables	When specialized critical reports are needed
Mirror Tables	When tables are required concurrently by two types of environments
Split Tables	When distinct groups use different parts of a table
Combined Tables	When one-to-one relationships exist
Redundant Data	To reduce the number of table joins required
Repeating Groups	To reduce I/O and (possibly) DASD
Derivable Data	To eliminate calculations and algorithms
Speed Tables	To support hierarchies

Denormalization: Pre-joined Tables

If two or more tables need to be joined on a regular basis by an application, but the cost of the join is too prohibitive to support, consider creating tables of pre-joined data. The pre-joined tables should

Contain no redundant columns.
Contain only the columns necessary for the application to meet its processing needs.
Be created periodically using SQL to join the normalized tables.

The cost of the join is incurred only once, when the pre-joined tables are created. A pre-joined table can be queried efficiently because every new query does not incur the overhead of the table join process.

Denormalization: Report Tables

Reports requiring special formatting or manipulation often are impossible to develop using SQL or QMF alone. If critical or highly visible reports of this nature must be viewed in an online environment, consider creating a table that represents the report. The table then can be queried using SQL or QMF.

Create the report using the appropriate mechanism in a batch environment. The report data then can be loaded into the report table in the appropriate sequence. The report table should

Contain one column for every column of the report.
Have a clustering index on the columns that provide the reporting sequence.
Not subvert relational tenets (for example, atomic data elements).

Report tables are ideal for storing the results of outer joins or other complex SQL statements. If an outer join is coded and then loaded into a table, you can retrieve the results of the outer join using a simple SELECT statement instead of using the UNION technique discussed in Chapter 1.

Denormalization: Mirror Tables

If an application system is very active, you might need to split processing into two (or more) distinct components . This requires the creation of duplicate, or mirror , tables.

Consider an application system that has heavy online traffic during the morning and early afternoon. The traffic consists of querying and updating data. Decision-support processing also is performed on the same application tables during the afternoon. The production work in the afternoon disrupts the decision-support processing, resulting in frequent timeouts and deadlocks.

These disruptions could be corrected by creating mirror tables: a foreground set of tables for the production traffic and a background set of tables for the decision-support reporting. To keep the application data-synchronized, you must establish a mechanism to migrate the foreground data periodically to the background tables. (One such mechanism is a batch job executing the UNLOAD utility on the data followed by running the LOAD utility to the target.) Migrate the information as often as necessary to ensure efficient and accurate decision-support processing.

Because the access needs of decision support and the production environment often are considerably different, different data definition decisions such as indexing and clustering may be chosen .

Denormalization: Split Tables

If separate pieces of one normalized table are accessed by different and distinct groups of users or applications, consider splitting the table into one denormalized table for each distinct processing group. Retain the original table if other applications access the entire table; in this scenario, the split tables should be handled as a special case of mirror table.

Tables can be split in two ways: vertically or horizontally. Refer to Figure 5.5. A vertical split cuts a table column-wise, such that one group of columns is placed into a new table and the remaining columns are placed in another new table. Both of the split tables should retain the primary key columns. A horizontally split table is a row-wise split. To split a table horizontally, rows are classified into groups by key ranges. The rows from one key range are placed in one table, those from another key range are placed in a different table, and so on.

Figure 5.5. Two methods of splitting tables.

graphics/05fig05.gif

When splitting tables, designate one of the two tables as the parent table for referential integrity. If the original table still exists, it should be the parent table in all referential constraints. In this case, do not set up referential integrity for the split tables; they are derived from a referentially intact source.

When you split a table vertically, include one row per primary key in both tables to ease retrieval across tables. Do not eliminate rows from either of the two tables. Otherwise , updating and retrieving data from both tables is unnecessarily complicated.

When you split a table horizontally, try to split the rows between the new tables to avoid duplicating any one row in each new table. Simply stated, the operation of UNION ALL , when applied to the horizontally split tables, should not add more rows than those in the original, unsplit tables.

Denormalization: Combined Tables

If tables have a one-to-one relationship, consider combining them into a single table. Sometimes, one-to-many relationships can be combined into a single table, but the data update process is significantly complicated because of the increase in redundant data.

For example, consider combining the sample tables DSN8810.DEPT and DSN8810.EMP into a large table called DSN8810.EMP_WITH_DEPT . (Refer to Appendix A for a definition of the sample tables.) This new table would contain all the columns of both tables, except the DEPTNO column of DSN8810.DEPT . This column is excluded because it contains the same data as the ADMRDEPT column.

Each employee row therefore contains all the employee information, in addition to all the department information, for each employee. The department data is duplicated throughout the combined table because a department can employ many people. Tables of this sort should be considered pre-joined tables, not combined tables, and treated accordingly . Only tables with one-to-one relationships should be considered combined tables.

Denormalization: Redundant Data

Sometimes one or more columns from one table are accessed whenever data from another table is accessed. If these columns are accessed frequently with tables other than those in which they were initially defined, consider carrying them in the other tables as redundant data. By carrying the additional columns, you can eliminate joins and increase the speed of data retrieval. Because this technique violates a tenet of database design, it should be attempted only if the normal access cannot efficiently support your business.

Consider, once again, the DSN8810.DEPT and DSN8810.EMP tables. If most employee queries require the name of the employee's department, this column could be carried as redundant data in the DSN8810.EMP table. (Do not remove the column from the DSN8810.DEPT table, though.)

Columns you want to carry as redundant data should have the following attributes:

Only a few columns are necessary to support the redundancy.
The columns are stable, that is, updated infrequently.
The columns are used by many users or a few important users.

Denormalization: Repeating Groups

When repeating groups are normalized, they are implemented as distinct rows instead of distinct columns. This usually results in higher disk storage requirements and less efficient retrieval, because there are more rows in the table and more rows must be read to satisfy queries that access the entire repeating group (or a subset of the repeating group).

Sometimes you can achieve significant performance gains when you denormalize the data by storing it in distinct columns. These gains, however, come at the expense of flexibility.

For example, consider an application that stores repeating group information in the following normalized table:

 CREATE TABLE  user  .PERIODIC_BALANCES    (CUSTOMER_NO        CHAR(11)        NOT NULL,     BALANCE_PERIOD     SMALLINT        NOT NULL,     BALANCE            DECIMAL(15,2),     PRIMARY KEY (CUSTOMER_NO, BALANCE_PERIOD)    )

Available storage and DB2 requirements are the only limits to the number of balances per customer that you can store in this table. If you decide to string out the repeating group, BALANCE , into columns instead of rows, you must limit the number of balances to be carried in each row. The following is an example of stringing out repeating groups into columns after denormalization:

 CREATE TABLE  user  .PERIODIC_BALANCES    (CUSTOMER_NO        CHAR(11)        NOT NULL,     PERIOD1_BALANCE    DECIMAL(15,2),     PERIOD2_BALANCE    DECIMAL(15,2),     PERIOD3_BALANCE    DECIMAL(15,2),     PERIOD4_BALANCE    DECIMAL(15,2),     PERIOD5_BALANCE    DECIMAL(15,2),     PERIOD6_BALANCE    DECIMAL(15,2),     PRIMARY KEY (CUSTOMER_NO)    ) IN SAMPLE.BALANCE;

In this example, only six balances can be stored for each customer. The number six is not important, but the limit on the number of values is important; it reduces the flexibility of data storage and should be avoided unless performance needs dictate otherwise.

Before you decide to implement repeating groups as columns instead of rows, be sure that the data:

Rarely ”preferably never ”is aggregated, averaged, or compared in the row
Occurs in a statistically well-behaved pattern
Has a stable number of occurrences
Usually is accessed collectively
Has a predictable pattern of insertion and deletion

If any of the preceding criteria is not met, some SQL statements could be difficult to code ”making the data less available due to inherently unsound data-modeling practices. This should be avoided, because you usually denormalize data to make it more readily available.

Denormalization: Derivable Data

If the cost of deriving data with complicated formulas is prohibitive, consider storing the derived data instead of calculating it. When the underlying values that compose the calculated value change, the stored derived data must be changed also; otherwise, inconsistent information could be reported .

Sometimes you cannot immediately update derived data elements when the columns on which they rely change. This can occur when the tables containing the derived elements are offline or are being operated on by a utility. In this situation, time the update of the derived data so that it occurs immediately after the table is available for update. Outdated derived data should never be made available for reporting and queries.

Denormalization: Hierarchies

A hierarchy is easy to support using a relational database such as DB2, but difficult to retrieve information from efficiently. For this reason, applications that rely on hierarchies often contain denormalized tables to speed data retrieval. Two examples of these types of systems are a Bill of Materials application and a Departmental Reporting system. A Bill of Materials application typically records information about parts assemblies, in which one part is composed of other parts. A Department Reporting system typically records the departmental structure of an organization, indicating which departments report to which other departments.

An effective way to denormalize a hierarchy is to create speed tables . Figure 5.6 depicts a department hierarchy for a given organization. The hierarchic tree is built so that the top node is the entire corporation. The other nodes represent departments at various levels in the corporation.

Figure 5.6. A department hierarchy.

graphics/05fig06.gif

Department 123456 is the entire corporation. Departments 1234 and 56 report directly to 123456. Departments 12, 3, and 4 report directly to 1234 and indirectly to department 123456, and so on. This can be represented in a DB2 table as follows :

`DEPTNO`	`PARENT_DEPTNO`	other columns
	Department Table
123456	------
1234	123456
56	123456
12	1234
3	1234
4	1234
1	12
2	12
5	56
6	56

This DB2 table is a classic relational implementation of a hierarchy. There are two department columns: one for the parent and one for the child. The table's data is an accurately normalized version of this hierarchy, containing everything represented in Figure 5.6. The complete hierarchy can be rebuilt with the proper data retrieval instructions.

Even though the implementation effectively records the entire hierarchy, a query to report all the departments under any other department is time consuming to code and inefficient to process. A sample query that returns all the departments reporting to the corporate node, 123456, is illustrated by this rather complex SQL statement:

 SELECT  DEPTNO FROM    DEPARTMENT WHERE   PARENT_DEPTNO = '123456' UNION SELECT  DEPTNO FROM    DEPARTMENT WHERE   PARENT_DEPTNO IN         (SELECT  DEPTNO          FROM    DEPARTMENT          WHERE   PARENT_DEPTNO = '123456') UNION SELECT  DEPTNO FROM    DEPARTMENT WHERE   PARENT_DEPTNO IN         (SELECT  DEPTNO          FROM    DEPARTMENT          WHERE   PARENT_DEPTNO IN                  (SELECT  DEPTNO                   FROM    DEPARTMENT                   WHERE   PARENT_DEPTNO = '123456'));

This query can be built only if you know in advance the total number of possible levels the hierarchy can achieve. If there are n levels in the hierarchy, you need n-1 UNION s. The previous SQL statement assumes that only three levels are between the top and bottom of the department hierarchy. For every possible level of the hierarchy, you must add a more complex SELECT statement to the query in the form of a UNION . This implementation works, but is difficult to use and inefficient.

A faster way to query a hierarchy is to use a speed table. A speed table contains a row for every combination of the parent department and all its dependent departments, regardless of the level. Data is replicated in a speed table to increase the speed of data retrieval. The speed table for the hierarchy presented in Figure 5.6 is:

`PARENT DEPTNO`	`CHILD DEPTNO`	`LEVEL`	`DETAIL`
123456	1234	1	N
123456	56	1	N
123456	12	2	N
123456	1	3	Y
123456	2	3	Y
123456	3	2	Y
123456	4	2	Y
123456	5	2	Y
123456	6	2	Y
1234	12	1	N
1234	1	2	Y
1234	2	2	Y
1234	3	1	Y
1234	4	1	Y
3	3	1	Y
4	4	1	Y
12	1	1	Y
12	2	1	Y
1	1	1	Y
2	2	1	Y
56	5	1	Y
56	6	1	Y
5	5	1	Y
6	6	1	Y

Contrast this to the previous table, which recorded only the immediate children for each parent. The PARENT_DEPTNO column is the top of the hierarchy. The CHILD_DEPTNO column represents all the dependent nodes of the parent. The LEVEL column records the level in the hierarchy. The DETAIL column contains Y if the row represents a node at the bottom of the hierarchy, or N if the row represents a node that is not at the bottom. A speed table commonly contains other information needed by the application. Typical information includes the level in the hierarchy for the given node and, if the order within a level is important, the sequence of the nodes at the given level.

After the speed table has been built, you can write speed queries. The following are several informative queries. They would be inefficient if executed against the classical relational hierarchy, but are efficient when run against a speed table.

To retrieve all dependent departments for department 123456,

 SELECT   CHILD_DEPTNO FROM     DEPARTMENT_SPEED WHERE    PARENT_DEPTNO = '123456';

To retrieve only the bottom-most, detail departments that report to department 123456,

 SELECT   CHILD_DEPTNO FROM     DEPARTMENT_SPEED WHERE    PARENT_DEPTNO = '123456' AND      DETAIL = 'Y';

To return the complete department hierarchy for department 123456,

 SELECT   PARENT_DEPTNO, CHILD_DEPTNO, LEVEL FROM     DEPARTMENT_SPEED WHERE    PARENT_DEPTNO = '123456' ORDER BY LEVEL;

A speed table commonly is built using a program written in COBOL or another high-level language. SQL alone usually is too inefficient to handle the creation of a speed table.

NOTE

With the advent of recursive SQL support in DB2 V8, the need to denormalize for hierarchies is greatly diminished. The nature of recursive SQL makes it ideal for traversing hierarchical data stored in normalized, DB2 tables. Refer back to Chapter 2 for more information on recursion.

Denormalization to Avoid Large Page Sizes

You can denormalize your tables to avoid using page sizes greater than 4KB. If a table space is so large as to have pages that require more than 4KB, DB2 will force the use of a larger page size (and buffer pool). DB2 supports 8KB, 16KB, and 32KB pages in addition to the standard 4KB page size .

A larger page size can increase overhead. For example, DB2 arranges a table space assigned to the BP32K buffer pool as 8 single 4KB pages per 32KB page. Every logical I/O to a 32KB table space requires 8 physical I/Os. You can use the vertical split technique to denormalize tables that would otherwise require pages greater than 4KB.

With DB2 V8 you have the ability to specify the actual CI size of the VSAM file to synchronize it with the DB2 page size. Using this approach is generally preferable, and easier, than vertically splitting the table.

Denormalization Guidelines

Be sure to follow these tips and techniques when making your denormalization decision ”and afterward to manage your databases effectively.

Look for Ways to Avoid Denormalization

Whenever possible, you should avoid denormalizing your DB2 databases. Look for DB2 features and functionality that can be used in place of a denormalized implementation.

For example, DB2 V8 introduces recursive SQL. Therefore, as of DB2 V8, you should try to implement fully normalized hierarchies and query them using recursive queries instead of using a denormalization technique like speed tables. More information on recursion is provided in Chapter 2.

Periodically Test the Validity of Denormalization

The decision to denormalize never should be made lightly: Denormalization involves a lot of administrative dedication. This dedication takes the form of documenting denormalization decisions, ensuring valid data, scheduling migration, and keeping end users informed about the state of the tables. An additional category of administrative overhead is periodic analysis.

When an application has denormalized data, you should review the data and the environment periodically. Changes in hardware, software, and application requirements can alter the need for denormalization. To verify whether denormalization still is a valid decision, ask the following questions:

Have the application-processing requirements changed such that the join criteria, the timing of reports, or the transaction throughput no longer require denormalized data?
Did a new software release change performance considerations? For example, does the introduction of 8KB and 16KB page sizes in DB2 Version 6 alleviate the need for denormalization to avoid 32K page sizes? Or, did the introduction of a new join method or faster join processing undo the need for pre-joined tables?
Did a new hardware release change performance considerations? For example, does a CPU upgrade reduce the amount of CPU consumption such that denormalization no longer is necessary?

In general, periodically test whether the extra cost related to processing with normalized tables justifies the benefit of denormalization. Monitor and reevaluate all denormalized applications by measuring the following criteria:

I/O saved
CPU saved
Complexity of update programming
Cost of returning to a normalized design

NOTE

To summarize, remember these basic rules:

All things being equal, always favor a normalized design over a denormalized design.
Normalization optimizes data modification at the expense of data access.
Denormalization optimizes data access at the expense of data modification.

< Day Day Up >