2.4 Undoing Normalization | Oracle High Performance Tuning for 9i and 10g

< Day Day Up >

Undoing Normalization, or removal of too much data model granularity, is essential to the performance of data retrieval from a relational data model. This is true not only for reporting and data warehouse databases but also, quite significantly, even for OLTP small transactional unit databases.

I am calling removal of Normalization for performance a process of undoing because Denormalization in its purest form, being Normalization in reverse, does not cover all practical possibilities. I have a number of areas of interest when it comes to undoing of granularity.

Denormalization: undoing various Normal Forms, some more than others.
Some useful tricks: various tricks to speed up data access.
Denormalization using unusual Oracle Database objects: creation of special types of database objects used to cluster, pre-sort and pre-construct data, avoiding excessive amounts of complex repetitive SQL.

The reason for removal or degradation of Normalization is for the purpose of improving performance. As already stated, it is very important to note that removal of Normalization granularity is not only required in data warehousing and reporting environments. It has often been the case, in my experience, where even OLTP transactional data models have required reduction in granularity produced by over-Normalization. Any application retrieving data from an over-Normalized database can potentially have performance problems. A very deeply Normalized database is only most effective for single-row insert, update, and deletion transactions, i.e. changing single data items.

In the past I have even seen some very poorly performing OLTP applications, even those with small transactional units. The reason for this poor performance is often the result of severely granular data models. These systems did not contain large numbers of rows or occupy a lot of physical space. Even so, brief data selection listings into one page of a browser more often than not performed join queries comprising many tables. These types of applications sometimes fail, their creators fail or both. In these cases there is a very strong case to be made for the necessity of designing a data model with the functionality of the application in mind, and not just the beauty of the granularity of an enormously over-Normalized relational data model structure.

One particular project I have in mind had some queries with joins containing in excess of 15 tables, a database under 10 Mb and some query return times of over 30 s on initial testing. These time tests were reduced to a few seconds after extensive application code changes of embedded SQL code. However, it was too late to change the data model without extensive application recoding and thus some very highly tuned and convoluted embedded SQL code remained. Maintenance will probably be a nightmare if it is even possible.

The conclusion is as follows. Normalization granularity should be degraded in order to speed up SQL data retrieval performance. How do we do this? The easiest and often the most effective method, from the perspective of the data model, is a drastic reduction in the number of tables in joins. It is possible to tune SQL statements joining 15 tables but it is extremely difficult. If the data model can possibly be changed then begin with that data model. Changing the data model obviously affects any SQL code, be it in stored procedures or embedded in applications.

So now we know why we might need to do the unexpected with a nicely Normalized data model. Let's start with Denormalization.

2.4.1 Denormalization

What is Denormalization? Simply put Denormalization is the opposite of Normalization. Where Normalization is an increase of granularity removing duplication, Denormalization is an attempt to remove granularity by reintroducing duplication, previously removed by Normalization. Denormalization is usually required in order to assist performance because a highly granular structure is only useful for retrieving very precise, small amounts of information rather than large amounts of information. Denormalization is used to analyze and report rather than facilitate changing specific data items. In simple terms Denormalize to decrease the number of tables in joins; joins are slow! Simple SQL statements are fast and easy to tune and should be the order of the day wherever possible.

It is sometimes the case that table structure is much too granular or possibly even incompatible with structure imposed by applications. This particularly occurs when the data model is designed with perfection of Normalization in mind without knowledge of or perhaps even consideration for realistic application requirements. It is a lucky development team that understands application requirements completely when the data model is being built. Denormalization is a solution in this case.

Denormalization is not rocket-science. Denormalization is synonymous with Normalization except the opposite; both are common sense.

Reminding Ourselves about Normalization

A relational database relates subsets of a dataset to each other. How do we relate these subsets? What do we mean by a subset and what is a data set?

A data set is the equivalent of a database in Sybase or Ingres and a schema in Oracle Database. A data set is a set of tables.
A subset is a table. A table defines the structure and contains the row and column data for each subset.
Tables are related to each other by linking them based on common items and values between two tables, namely primary and foreign keys.

At this point if you have not as yet read the previous chapter on Normalization then please do so. I have an unorthodox approach to Normalization and much of that approach is simply reversed in my approach to Denormalization. My perspective of Normalization is one of it being mostly much too complex for commercial practicality.

Why Denormalize?

Denormalize to speed up a poorly performing database by removing potential for mutable and complex SQL table joins. Joins are difficult and sometimes even impossible to tune.

What to Look for to Denormalize

Mutable and Complex Joins

A mutable join is a join of more than two tables. A complex join is a mutable join including filtering. The more tables in a join the slower it will execute and the more difficult it will be to tune. If these types of joins are heavily executed, simplification by Denormalizing offending tables could help performance. Following is an ANSI format SQL statement showing a mutable join of eight tables in the Accounts data model (see Appendix A). This query is completely ridiculous but it is the sort of complexity that you might want to search for.

SELECT cu.customer_id, o.order_id, ol.seq#, ca.category_id FROM customer cu JOIN orders o ON (cu.customer_id =     o.customer_id)   JOIN transactions t ON (o.order_id = t.order_id)    JOIN transactionsline tl ON (t.transaction_id =        tl.transaction_id)    JOIN ordersline ol ON (o.order_id = ol.order_id)       JOIN stockmovement sm ON (tl.stockmovement_id =           sm.stockmovement_id       AND ol.stockmovement_id = sm.stockmovement_id)          JOIN stock s ON (s.stock_id = sm.stock_id)           JOIN category ca ON (ca.category_id =               s.category_id) WHERE ca.text = 'Software';

The WHERE clause adds filtering, making the mutable join a complex mutable join. The ANSI format of this query is no more difficult to tune than the Oracle Proprietary format. This example shows the number of tables joined very clearly.

Mutable Joins to Find Few Columns

When constructing SQL statement joins are you finding many tables in joins where those tables are scattered throughout the entity relationship diagram? When finding those columns are you passing through one or more entities from which no columns are retrieved? This is inefficient because every table passed through adds another table to the join. This problem can be resolved in two ways. The first is by Denormalization, which may be difficult, because there could possibly be so much Denormalization that it may affect functionality in too much application code. The second possible solution is to maintain copies of the offending column values in both entities. Refer to the section later in this chapter entitled Copying Columns between Entities (p. 75).

Tip

This type of joining is a potential indicator that a single data model services multiple loosely connected applications. Loosely connected applications should not always be placed into the same dataset (Oracle Database schema).

Following is the same ANSI format query as above but with a small change where only the CUSTOMER_ID and CATEGORY_ID columns are retrieved. Of course the join is still ridiculous and probably would never be a requirement for an application, but this is the only way that the stock category can be linked to a customer. This is definitely a problem. In this example Denormalization would be nonsensical but some type of a relationship could possibly be established between the Customer and the Stock tables. On the other hand, new entity relationships would only serve to complicate the data model further.

SELECT cu.customer_id, ca.category_id FROM customer cu JOIN orders o ON (cu.customer_id =     o.customer_id)   JOIN transactions t ON (o.order_id = t.order_id)    JOIN transactionsline tl ON (t.transaction_id =        tl.transaction_id)      JOIN ordersline ol ON (o.order_id = ol.order_id)         JOIN stockmovement sm ON (tl.stockmovement_id =            sm.stockmovement_id         AND ol.stockmovement_id = sm.stockmovement_id)          JOIN stock s ON (s.stock_id = sm.stock_id)           JOIN category ca ON (ca.category_id =              s.category_id) WHERE ca.text = 'Software';

Adding Composite Keys

Do entities have composite keys? It is possible to partially Denormalize by adding composite key elements to the primary keys of subset tables. Composite keys are totally contrary to object structure and more compatible with reporting. Java applications may perform poorly when data is accessed using composite keys. Object applications perform best when accessing precise objects containing collections of other precisely defined objects.

Tip

Composite indexes not constructed as primary keys fall into the realm of alternate indexing. Refer to the section earlier in this chapter entitled Optimizing with Alternate Indexes (p. 59).

A purist's relational database structure will contain composite primary key columns in subset tables for all parent tables above in the hierarchy. Java applications typically add a single column unique identifier to each table as a primary key and exclude parent table foreign key columns from the primary key. The only problem with this type of object to relational mapping, top-down Java design is as follows. Reporting and even on-screen listings of less than a page can result in mutable joins, extremely hard-hitting SQL join statements. This is where those 15 table joins start to appear. The root cause and reason for the existence of these types of mutable joins is usually extreme over-Normalization, particularly with 3rd, 4th, and 5th Normal Forms. Normal Forms are a relational and not an object methodology. Figures 2.2 and 2.3 contain composite primary keys. Figure 2.4 contains a Java top-down object-relational mapping structure.

One-to-One Relationships

Look for one-to-one relationships. These may be unnecessary if the required removal of null values causes costly joins. Disk space is cheap. Mutable SQL join statements can destroy performance. These one-to-one relationships are typical of 4th and 5th Normal Forms. An example of this problem is shown in Figure 2.8.

click to expand
Figure 2.8: 4th Normal Form Denormalization

Many-to-Many Join Resolution Entities

Do you have many-to-many join resolution entities? Are they all necessary? Are they all used by applications? Many-to-many join resolution entities are typical of 3rd Normal Form. Quite often when a data set is first designed there may be overusage of 3rd Normal Form leaving some superfluous, performance-hitting entities which can become more and more difficult to remove due to required application code changes. Two things are important about 3rd Normal Form created entities:

Are the new entities used by the application? Quite often these entities are created from the perspective of the data model and may never be used by the application and may not need to be used either.
A clue as to the usefulness of these new entities is do they have meaning? Are the names of these entities a combination of the parent entity names or do they have a name, which is distinct only to the new entity? The Assignment entity in Figure 2.10 is a good example of a meaningful many-to-many relationship entity and the CourseStudent entity in Figure 2.9 is a good example of a potentially meaningless many-to-many join resolution entity.

Figure 2.9: 3rd Normal Form Denormalization

Application Functions versus Entities

Compare the number of functions in the application and the number of tables. If you have far more tables than functional units you might want to Denormalize or simply remove some of the entities because they may have become redundant.

Static Data in Multiple Entities

When searching for static data items such as customer details are you querying single or multiple tables? Querying a single table is much more efficient than multiple tables. Static data does not need to be Normalized into separate entities unless said separate entities are accessed individually using separate SQL statements. Even so static tables are often small and generally will be full table scanned by the Optimizer, regardless of any indexing. A good example of this is represented by the Listed and Exchange entities in Figure 2.11. The Listed and Exchange entities do not need to be separate entities.

Intermediary Tables Covering Summary Groupings and Calculations

Extra tables can be created, perhaps regenerated periodically, to contain summaries of larger tables. These types of tables are more akin to reporting and data warehouse read-only functionality but they sometimes can be useful in OLTP applications. These tables are useful as long as less than real-time response is acceptable for the summarized information. The script shown creates a table in the Accounts schema (see Appendix A) which is a grouping summary of a join between the COA and GeneralLedger tables.

CREATE TABLE ChartOfAccounts AS   SELECT coa.coa#, coa.type, coa.subtype, coa.text,      sum(gl.dr), sum(gl.cr)   FROM coa, generalledger gl   WHERE coa.coa# = gl.coa# GROUP BY coa.coa#, coa.type, coa.subtype, coa.text;

Denormalizing by Reversing Normal Forms

Let's explain Denormalization by example. Let's take a look at the reverse of the 3rd, 4th, and 5th Normal Forms. It is unlikely you will ever want to reverse 1st and 2nd Normal Forms. The only case would be where the difference between the number of rows in master and detail tables is extremely small; even so this strategy is not advisable.

A very common cause of performance problems is caused by the over-application of 4th Normal Form to remove null values from entities. Examine the diagram in Figure 2.8. Note the one-to-one or zero relationships between Customer to Listed and Customer to Address entities. The zero part of these relationships implies that a customer does not have to have either a stock ticker symbol or an address. Thus 4th Normal Form allows the removal of these two sections to new entities, the Listed and Address entities.

Removal of null values from entities to new entities can save space but this is a debatable issue. Since null values are empty and Oracle Database has variable record lengths the space saving element may be negligible. SQL statements to retrieve customer details from these three entities would require a three-table join to retrieve all of the customer information. When it is taken into account that other information will be retrieved with customers, such as orders and invoices, very large joins could result. The 4th Normal Form detail shown in the top part of Figure 2.8 could cause very complex mutable joins, difficult to tune, and thus lead to severe performance problems. These types of performance problems are often detected only when applications are placed onto realistically sized production databases at the end of the development cycle. Making data model changes in the maintenance cycle can be so costly as to make them impossible. It is best not to do this sort of thing in the first place.

Now let's take a look at 3rd Normal Form. One cannot really present the argument that 3rd Normal Form is for academics and is commercially unusable because 3rd Normal Form is often usable. However, the point to make about 3rd Normal Form is that every 3rd Normal Form many-to-many join resolution entity should be closely examined. Is it really needed and is it meaningful?

Look at the diagram in Figure 2.9. Would you really need to locate the details of a single course for a single student, or vice versa? Think about it from the perspective of how your application retrieves this type of data and how it interfaces with the reader. Does it ever require you to list the details of a single course on which a single student is enrolled? Perhaps yes. If so then that information may have already been pre-selected from the database. The point is this: 3rd Normal Form many-to-many join resolution entities should only be created when they have meaning or when they are absolutely required.

So 3rd Normal Form should create meaningful entities from many-to-many relationships. 3rd Normal Form is best not used to create many-to-many join resolution entities such as the CourseStudent entity in Figure 2.9. Why?

Look at the schema shown in Figure 2.10. Note how the Assignment entity does two things. Firstly, it does resolve a many-to-many join between the Task and Employee entities. Secondly, and more importantly, the Assignment entity has meaning in itself since it defines the assignment of tasks to employees.

click to expand
Figure 2.10: The Employees Schema

The practical application of 3rd Normal Form is that many-to-many joins resolved into new entities should produce a meaningful entity, not meaningless unique identifiers which may or may not ever be used. More often than not CourseStudent type many-to-many join resolution entities as shown in Figure 2.9 are not used and if they are, perhaps should not be.

Now let's retrogress a little. Let's go back to the 4th Normal Form entity structure as presented in Figure 2.8 but include an extra entity. Examine the diagram in Figure 2.11. Note the new entity Exchange and the addition to the Listed table of the EXCHANGE_ID foreign key column. The relationship between the Listed and Exchange entities is a many-to-one relationship, placing the tables into 2nd Normal Form. Why is this 2nd Normal Form? This is because the Exchange table will contain items such as NYSE (New York Stock Exchange) or NASDAQ. The names of exchanges are static relative to the companies listed on those exchanges. Ticker symbols are used to represent companies on stock exchanges.

click to expand
Figure 2.11: 2nd Normal Form Denormalization

The big arrow in Figure 2.11 shows the removal of the Exchange entity by placing the name of the exchange into the Listed entity. This is effective Denormalization since there is no information on the Exchange entity other than its unique identifier (EXCHANGE_ID) and the name of the exchange. In short, the Exchange entity in this situation is completely unnecessary and will probably ultimately hurt performance by requiring more complex mutable join SQL statements.

2.4.2 Some Useful Tricks

There are many tricks to Denormalize data that are not effectively reversals of the steps of Normalization. Here are some of them.

Copying columns between entities: making copies of columns between unrelated entities to avoid mutable joins.
Placing summary columns into parent entities: avoid grouping joins but continual real-time updating of these types of columns can cause hot block issues, sometimes better updated in batch mode during low activity cycles.
Separating active and inactive data: avoiding searching through data no longer often used in order to reduce the amount of physical space searched through.
Mixing heavily and lightly accessed columns: similar to the above, separating heavily and lightly accessed columns into separate entities, avoiding unnecessary access to rarely used data.
Focus on heavily used functionality: tune specific areas of a data model.
Using views: using views is often useful for application coding but often detrimental to database performance.
Local application caching: store static values on client or middle-tier machines to avoid repetitive database access.

Tip
Any other related useful Denormalization tricks that readers would like to see added to this section can be included in the next edition of this book. Please email the author with details.

Copying Columns between Entities

This is a case where two entities, generally somewhat disassociated from each other in a data model, can be made to contain a copy of the same value. This would be done in order to avoid joining between multiple tables, through a hierarchy, to obtain a single value.

Figure 2.12 shows a section of the Accounts data model (see Appendix A). The date field DTE should be maintained only in the StockMovement entity but it is duplicated on the Orders and Transactions entities as well. This helps to avoid joins from Orders and Transactions through StockMovement entries simply to retrieve the date.

click to expand
Figure 2.12: Duplicating Individual Columns Between Entities

One could use event triggers to copy column values between entities but this activity is better implemented at the application level. Use of triggers can have a dire effect on performance in highly active concurrent databases.

Placing Summary Columns into Parent Entities

Summary columns can be placed into parent entities to summarize information held in child entities. Obviously these summaries would be commonly accessed summary amounts; we would not want to be continually joining many tables and re-executing grouping functions simply to find a simple summary. The problem with using summary columns is potential locking and concurrency hot block problems on the summary column table rows, if those summary columns are updated in real time. Typically these types of Denormalized structures are batch updated at times of low activity.

Figure 2.13 shows another section of the Accounts data model (see Appendix A). The balance column on the Customer and Supplier entities summarize their respective transaction and cash-book entity amount entries; maintenance of summary columns of this nature in real time may be best updated in batch mode. Using event triggers on child tables will cause even greater performance problems.

click to expand
Figure 2.13: Summary Columns in Parent Entities

Separating Active and Inactive Data

Separation of active and inactive data effectively creates further granularity but it has the benefit of assuring unused data does not interfere with highly accessed data. This can help immensely with performance, especially when the amount of historical or archived data far outweighs that of current data, as is often the case. This tactic is a primitive form of partitioning or data warehousing without the data warehouse. A prime candidate for this type of separation in the Accounts schema (see Appendix A) would be the GeneralLedger entity; general ledger entries in accounting systems can often become large very rapidly. Splitting an entity such as this into active and archived entities based on some sort of accounting period such as years, quarters, or even months would be expedient.

Separation of active and inactive data could affect application design. Since data is separated the application would have to be constructed such that it can access both types of data separately or together. Obviously if data is being accessed together at the application level then there is probably not much point in separating it in the first place. An effective application-level solution is having multiple layers of functionality, thus multiple programs. As an example the GeneralLedger table in the Accounts schema could have separate reporting programs for dealing with current and previous periods when executing reports against the GeneralLedger table. At the application level a report such as this could be written with date or year input parameters allowing access to active or archived data, or both. This leads to the fact that the GeneralLedger table would be a perfect example for Oracle Partitioning, which will be covered in Part III on physical and configuration tuning.

Mixing Heavily and Lightly Accessed Columns

Mixing heavily and lightly accessed columns in SQL statements can cause problems. Perhaps separate those columns into different entities. However, be very careful with this sort of thing. This kind of tuning is more or less the same as 4th Normal Form Normalization, which causes its own problems as well.

Focus on Heavily Used Functionality

This perspective simply stresses focusing on specific areas of the data model or providing better performance in heavily used areas of applications. In other words, specific tuning of particular tables or indexes. Sometimes this approach can involve use of exotic object types such as clusters, index-organized tables, partitioning, materialized views, amongst other Oracle software bells and whistles.

Using Views

Views are frequently used by developers to simplify application design and coding, a similar approach to that of using generic static entities as shown in Figure 2.1. In general, any type of simplification of application coding leads to poor data model structure. Views often lead to too many database objects and thus too much data model complexity. Views are effectively overlays producing more complexity than is necessary at the database level due to the constant resolving of view contents. Some applications even use multiple layers of views, making problems even worse. Views absolutely do not help SQL code and database performance.

Application Caching

Heavily used data which is not constantly changing can be cached on client or middle-layer machines. For instance, a date such as today's date not requiring a time as well only requires refreshing every 24 h. Some of the highest execution quantities of SQL code are often for statements performing the simplest of tasks such as retrieving today's date. A common SQL statement of this type in Oracle SQL is a SELECT SYSDATE FROM DUAL statement. Even the simplest of SQL statements can affect performance if executed often enough. With Java object applications application-level caching is often effectively utilized. Object-relational mappings function best using pre-loading of static data at application startup.

2.4.3 Denormalization using Special Purpose Oracle Database Objects

Denormalization using unusual Oracle Database objects allows creation of special types of database objects used to cluster, presort and pre-construct data, avoiding excessive amounts of complex repetitive SQL. In effect Denormalization of a relational data model is a way of undoing granular complexity. The reason for this "undoing" process is in order to increase performance by minimizing of hard-hitting SQL statements. There are numerous types of logical data objects available in Oracle Database, which do what could be called a form of Denormalization into pre-constructed physically organized data sets. These objects are often used to precreate data sets to be retrieved at a later date, in the structure and form required by an application. These object types are generally as follows:

Clusters. Places most commonly accessed index and some data columns together in the same place physically, in the desired order.
Index-Organized Tables. Sorts both the index and the data columns of a table into a required sorted order.
Materialized Views. Pre-constructs and stores the results of an SQL statement avoiding repetitive SQL code execution in order to produce the desired result. A Materialized View "materializes" the data. In other words, it stores data physically separately from the source tables. Unlike a materialized view a view is simply an overlay of another entity or entities, re-executing an SQL statement every time the view is accessed. Simple views are not helpful for performance, whereas materialized views are. Materialized views are often used in data warehouses.

All these solutions are a physical rather than a logical form of Denormalization. These objects do not reduce logical complexity but increase complexity by creating more data object entities. What they do accomplish is to place data sets into new purpose-built constructs, which can be used to access data in the required order without jeopardizing access speed to highly active concurrently used entities.

The next chapter will briefly discuss various different forms of the Relational Database model.

< Day Day Up >