The data warehouse is the place where a consistent view of an organization's data can be published for users to be able to access it. The first characteristic of the data warehouse is that it is separate. If we are really serious about providing easy access to all the information, we have to create a separate relational database with a design and an operational approach that is optimized for queries rather than atomic transactionsthis is the data warehouse.
Data from all the source systems is loaded into the warehouse (see Figure 1-2) through a process of extraction, transformation, and loading that produces a clean, validated repository of information. This information is organized and presented to the users in a way that enables them to easily formulate their business questions, and the answers are returned orders of magnitudes faster than similar queries against the transaction systems so that the users can immediately reformulate their question and get more details.
Figure 1-2. Data warehouse loaded from source systems
The Data Warehouse Design
The data warehouse is still a relational database, but that doesn't mean we are constrained to stick to the fully normalized, entity-relationship (ER) schema that is so appropriate for OLTP systems. Over time, the various approaches to designing a database schema that is optimized for understanding and querying information have been consolidated into an approach called a dimensional model.
At the center of the dimensional model are the numeric measures that we are interested in understanding, such as sales revenue or profit margins. Related measures are collected into fact tables that contain columns for each of the numeric measures. Every time something measurable happens, such as a sales transaction, an inventory balance or when an event occurs, a new record is added to the fact table with these numeric values.
There are usually many different ways that people can look at these measures. For example, they could look at totals for a product category or show the totals for a particular set of stores. These different ways of looking at the information are called dimensions, where a dimension is a particular area of interest such as Product, Customer, or Time. Every dimension table has a number of columns with descriptive text, such as product category, color, and size for a Product dimension. These descriptive columns are known as attributes; the more interesting attributes you can make available to users, the better.
The resulting database schema consists of one or more central fact tables, and a number of dimension tables that can be joined to these fact tables to analyze them in different ways. This design is usually known as a star schema because of the shape, as shown in Figure 1-3.
Figure 1-3. Star schema
If you have a strong background in OLTP databases, the idea of not necessarily normalizing data is probably at this moment causing you to reconsider the money you just spent on this book. Rest assured: We are not advocating ditching normalization altogether, but this is just one tool in our kit. Dimensional databases have different purposes, and different constraints. We can make appropriate decisions about the correct design of a particular database by looking at the ways it will be used, rather than necessarily trying to apply standard OLTP designs to every database.
Time and the Data Warehouse
Probably the most important dimension in any data warehouse is the Time dimension. This is the dimension that allows users to summarize the information in the fact tables in a way that matches up to the real world. They can use this dimension to look at totals for the current calendar year or to compare the percentage improvement over the previous fiscal period, for example. Although modern query languages have many flexible functions for working with date values, the best way to accommodate all the real-world complexities of analyzing information by time is to add a Time dimension table to the data warehouse, loaded with records starting from the earliest fact record that is available.
An important characteristic of the data warehouse is that it stores history. This idea is often misinterpreted because OLTP systems also store transactions going back in time (some for many years), so why is this feature of the data warehouse so important? Actually, there is a lot more to storing history accurately than just keeping a set of transactions around. For example, if every sales manager in the OLTP system is related to a set of customers in a sales territory, what happens when the sales territories' boundaries have been updated and you try to run an analysis for previous calendar years? The data warehouse must be capable of accurately reproducing the state of the business in the past as well as the present.
Most measures in a fact table are additive. That is, all the numbers can be added up across any time period that a user selects, whether that is a single day or several months. The benefit of additive measures is that they can easily be used to create summaries by simply summing the numbers. Some measures may not be additive across time periods or some other dimension and are known as semi-additive. Examples of these include monthly balances such as inventory on hand or account balances.
Getting Data into the Data Warehouse
Because the data warehouse is separate from all the other systems, an important part of the data warehouse process is copying data from the various source systems, restructuring it as necessary, and loading it into the warehouse. This process is often known as ETL, or extraction, transformation, and loading, sometimes with an additional M on the end (ETLM) to remind us of the need to actively manage this process.
The exact approach that you take for a given data warehouse depends on a lot of factors such as the nature of the source systems and business requirements for timely data, but a typical ETL process is a batch process that is run on a daily or weekly basis. The first part of the process involves extracting data from the source systems, either through direct queries against the systems using a data access interface such as ODBC or OLE DB or through the export of data files from within the systems.
This source data is then transformed into the correct format, which involves the obvious tasks such as matching data types and formats but also more complex responsibilities such as checking that valid business keys are supplied. When the data is in the right format, it is added to the data warehouse tables. Fact table loading usually involves appending a new set of records to the existing set of records for a particular date range. Updates to fact records are relatively uncommon in practice, but you can accommodate them with some special handling.
Dimension table loading often involves appending new records, but sometimes takes the form of updates to the attributes on existing records. These updates can have the unfortunate side effect of destroying our ability to look at historical data in the context that existed at that time. If it is important for a particular dimension to preserve the ability to look at data using the attribute values that existed in the past, the dimension is known as a slowly changing dimension (SCD), and Chapter 8, "Managing Changing Data," describes some well-established techniques for dealing with this.
Some ETL processes include a temporary database called a staging database, which is used to store a copy of the data that is currently being processed on the way to the data warehouse. The data in the staging area can then be manipulated by very efficient SQL operations such as joins. The disadvantage of having a staging area is that the data needs to be written more than once on the way from the source system into the data warehouse, which can add a lot of overhead to the process. SQL Server's ETL facilities use a "pipeline" approach that can often address all the ETL requirements without requiring a data staging step.
The best way to think of ETL is not as a process of copying and transforming data from one system to another, but rather as a process of publishing data. The publishing process includes a great deal of focus on data quality and provides a management process to catch any errors or omissions and correct them before the users can access the information.
In summary, our proposed approach is to build a consistent relational data warehouse with a dimensional schema optimized for queries. Even so, real-world applications often involve millions or billions of transactions with complex ad-hoc queries, and even the best relational query engine is going to take some time to return information. Because our goals are to provide fast and intuitive access to information, is relational database technology the best we can do?