Readers of this chapter probably fall into one of three categories. They have:
Never designed a database before
Designed a database for a transaction processing-type system
Built a data warehouse system
In the latter case, the reader could skip this text or use it as a refresher, especially if his or her last database used Oracle. Therefore, this chapter is aimed at readers who fall into categories one or two, which may surprise the person who has previously designed a nondata warehouse database. Why? Because the skills and techniques used to create a database for a data ware-house will be different from those required for a transaction processing-style system. Consequently, while you will have a head start because some of the techniques are the same, it is very important to say to yourself: I am designing a different type of database.
So what is different about designing a database in a data warehouse? In a transaction-processing system, the designer's goal is to make the transaction complete very, very quickly, and the designer also has the benefit of knowing how the business will interrogate and use the data. Contrast that with a data warehouse, where, although queries must complete as quickly as possible, they could still take hours.
Another major problem is determining what information should be held in the warehouse and at what level of granularity it should be retained. This book will not discuss the techniques that can be used to determine what should be included in the warehouse and how to go about collecting this data; there are already many books available that discuss this topic extensively.
However, the importance of trying to determine what should be included in the data warehouse cannot be stressed enough. It is so important because it may not be until a year after the warehouse is in production that you suddenly discover that the information is either not available or held at an inappropriate level. For example, a telephone company decides not to hold every call in its database but instead holds a total of what the customer spent by day. Then someone in the company decides to offer customers a discount when certain numbers are called. Now, if the warehouse had contained every single telephone call made by its customers, the company would be able to find out exactly what this scheme would have cost had it been implemented over the last 12 months. Instead, no data is available and one would either have to guess what the cost might be or postpone the planned new system until sufficient data is available to accurately determine the true cost to the company.
One of the difficult decisions for the designer is to determine at what level data will be stored in the warehouse. Often, storing every transaction, such as in our telephone example, may seem rather excessive, and, because it could easily mean the warehouse grows to many terabytes, there is a temptation to consolidate the data. Managing a terabyte warehouse requires careful and stringently controlled procedures. The bigger the database becomes, the harder it is to manage and query it. Therefore, there is a temptation to aggregate the information.
Since aggregation is a major design decision, the designer would be wise to seek approval from the users of the warehouse before adopting such a strategy. It should also be clearly explained to these users the limitations that are likely to occur due to aggregating the data. With disks being so cheap, hopefully most sites will store all of the data that they require.
The typical approach used to construct a transaction-processing system is to construct an entity-relationship (E-R) diagram of the business. This is then ultimately used as the basis for creating the physical database design, because many of the entities in our model become tables in the database. If you have never designed a data warehouse before but are experienced in designing transaction-processing systems, then you will probably think that a data warehouse is no different from any other database and that you can use the same approach.
Unfortunately, that is not the case, and warehouse designers will quickly discover that the entity-relationship model is not really suitable for designing a data warehouse. Leading authorities on the subject, such as Ralph Kimball, advocate using the dimensional model, and we have found this approach to be ideal for a data warehouse.
An entity-relationship diagram can show us, in considerable detail, the interaction between the numerous entities in our system, removing redundancy in the system whenever possible. The result is a very flat view of the enterprise, where hundreds of entities are described, along with their relationships to other entities. While this approach is fine in the transaction-processing world, where we require this level of detail, it is far too complex for the data warehouse. If you ask a database administrator (DBA) if he or she has an entity-relationship diagram, they will probably respond that there was one once, when the system was first designed. But due to its size and the numerous changes that have occurred in the system during its life-time, the entity-relationship diagram has not been updated, and it is now only partially accurate.
If we use a different approach for the data warehouse, one that results in a much simpler picture, it should be very easy to keep it up-to-date and also to give it to end users to help them understand the data warehouse. Another factor to consider is that entity-relationship diagrams tend to result in a normalized database design, where, as in a data warehouse, a denormalized design is often used.
An alternative to using the entity-relationship model is the dimensional model, which is different from the entity-relationship model because it views data from a different perspective. Instead of considering an entity, which represents a thing such as a product or a place and the relationships between those entities, a dimensional model describes data using dimensions and facts, which become actual tables in the database.
Dimensional models, as illustrated in Figure 2.1, despite sometimes looking quite simple, provide a very effective way of holding historical and current data in a form that makes this data accessible to business users and that enable them to make the right business decisions. A dimensional data warehouse can be viewed as containing data that:
Has been validated (i.e., no invalid product codes)
Is historical (i.e., the last 36 months)
Is integrated (the same key is used by all systems)
Is easily accessible
Figure 2.1: Dimensional modeling.
The fact table, of which there could be more than one, contains factual information, is usually the largest table in the data warehouse, and is typically fast growing. The fact tables are typically where all of the detail data that you want to keep in the data warehouse are stored, such as all of the telephone calls made by a customer or the orders placed by your customer, as shown in Figure 2.1.
Therefore, if a customer made 20 telephone calls, then it is likely that 20 rows will be stored in the fact table for this customer. Consequently, the fact tables will be by far the largest table in the database, possibly containing hundreds of millions of rows in a large data warehouse. If you are unsure as to whether data is factual, it is often numeric and is sometimes a value that may be computed, such as the value of an order or the number of items purchased.
The information contained in the fact table doesn't have to be at the finest level of detail; it could be summarized data, such as total telephone calls made by a customer today. The level at which data is held in the fact table is known as the granularity and is one of the important decisions the ware-house designer must make. In the example described here, the difference in the number of records stored over a 24-month period would be huge. Contrast the storage requirements between a record stored for every telephone call a customer made in a single day compared with a record for every telephone call a customer makes.
When designing a data warehouse, depending on your business, you may find that there are different types of fact tables, such as transaction level, transaction-item level, event based, status, or even summarized data.
When designing using the dimensional model, there may only be one or a small number of fact tables, but there could be many dimension tables. The dimension table can be seen as a reference table to the fact table, where descriptions and more static information about a piece of data are held. For example, product is considered a dimension because, in this table, everything about the product is held, such as full product name, suppliers, and pallet size. In the fact table, there would be a column called "product_key," which is used to retrieve all of the product information from this dimension table.
If you are uncertain as to whether data is a dimension or a fact, ask this question: Is the data relatively static? Typically, dimensions such as a product_id do not change frequently, whereas a fact table would contain the details of the products you had sold. There is also usually an order of magnitude of difference between the number of rows in the fact table, compared with the much fewer rows in the dimension table.
A fact table will contain millions of rows, whereas a dimension table could have only a few rows (e.g., the time dimension could have as few as 52 rows if data was stored weekly for one year). Or a region dimension could contain only 15 rows, if the country had only 15 regions. Dimensions don't have to be small in size, because you could sell 50,000 products or have a customer dimension with 5 million rows. All of these are examples of valid dimensions.
It is hard to say how many dimensions your design will require, but typically there will be less than 20 dimensions and at least four. Therefore, our data warehouse will comprise only a few tables, but it will have huge storage demands because of the number of rows in the fact table.
Data in the warehouse will most likely come from a variety of sources, and a product code in one system may not be the same as in another system. Another problem is that when data is being stored over a period of time, keys used in the production system could be reused. Therefore, the designer should seriously consider implementing surrogate keys, so that they have total control over how data is identified within the data ware-house. The conversion of the production key to the data warehouse key will be handled during the ETL process and incurs negligible overhead during data loading. All keys are candidates for being transformed into surrogate keys, and that even includes time keys. Your surrogate keys do not have to be very sophisticated and could simply start at one and increase sequentially using Oracle sequences.. There may also be data storage savings if surrogate keys are implemented.
When it comes to whether the data in the warehouse should be normalized, not everyone agrees on the same approach. Some experts believe that the warehouse should be normalized, while others think that using dimensional normal form is more appropriate.
Dimensional normal form is rather interesting, because it looks like a combination of normalization and denormalization. In Figure 2.2 we see the difference between the two approaches for the store dimension.
Figure 2.2: Normalized versus dimensional normal form.
The normalized case is also referred to as snowflaking. Although Oracle 9i will accept normalized dimensions, take care using this approach. One of the disadvantages is that it may impact performance because more joins will be needed in queries, which will take time. Snowflaking the dimension is a good example of a technique used in transaction-processing systems but is not always appropriate in the data warehouse.
At first glance, the dimensional normal form version appears to be duplicating the data. While this may seem an unacceptable storage over-head, in reality the number of rows in the dimensions is typically very small when compared with the size of the fact table. Therefore, you will probably be surprised to learn that storing these extra data may only cost you a few tens of megabytes. The advantage of this approach is that now only two levels of navigation in the model are required to access information, thus making it easier to construct queries that return data quickly.
An alternative to creating one large data warehouse is to create data marts, where a data mart contains a subset of the data in the warehouse. Data marts have the advantage of being focused into an area of the business, so they could contain regional or departmental data. However, care should be taken if the data mart approach is used, because instead of creating one data source, multiple data marts may be easy to manage and ideal for reporting purposes, but trying to integrate those data might be extremely difficult. Thus, the end result could be data marts containing duplicated data that cannot communicate with each other. However, it is not uncommon for an organization to first create data marts and then use those as the basis for creating the enterprise data warehouse.
Throughout this book, we will use an example based on a fictitious company called Easy Shopping Inc. It is an organization that has no retail out-lets and sells via its Internet site or via satellite or cable television. In Figure 2.3 we can see our dimensional model for Easy Shopping Inc.
Figure 2.3: Dimensional model for Easy Shopping Inc.
In this example, we have a fact table called "Purchases," where we record every item that our customers purchase. Four dimensions have been defined, where we hold customer, product, time, and details of our daily special offers. Although this may be a simple example for the purposes of this book, even the ones that you create will not be that much more complex. However, you will have more dimensions and many more columns in your fact table.
Warehouse schemas are sometimes called star schemas, and Figure 2.3 is an example of one. The center point is the fact table, and the dimensions sit around the fact table as the points of our star. As per our entity-relationship diagram, once you have drawn the dimensional model, it can easily be translated into a physical database design, since each box represents a table. Although in this book we refer to fact and dimension tables, inside the Oracle database they are all tables and are treated as such. However, before you jump in and create the physical database from this dimensional model, there are a few more decisions to make before the design is complete.