Data Warehouse Characteristics | MicrosoftВ® SQL Server(TM) 2005 Integration Services Step by Step

A data warehouse is usually the core component of a BI infrastructure for today's corporations. This chapter will not go into the details of building and maintaining a data warehouse. It is important, however, to look at the characteristics and design considerations that are crucial to BI infrastructure solutions.

The entity-relation diagram, shown in Figure 12-1, illustrates some fundamental characteristics of modern-day data warehouses. Note that the table names are prefixed with Fact and Dim to denote either a business fact or a dimension table.

image from book
Figure 12-1: Data warehouse database diagram

A data warehouse is implemented as a relational database model and can have different design variations to meet specific business needs. It's not unusual for a data warehouse to store more data than the source systems that feed it. Data cleansing, standardization, and derivation can add to the data volume. Many data warehouses are designed to maintain more historic data than traditional online transaction processing (OLTP) systems. How much historic data is required for comparison and trend analysis? Perhaps tens of millions of rows. Even though the data volume might be extremely large, the database model is designed for fast access for reporting.

Data warehouses, if designed and maintained appropriately, provide the most accurate and reliable data for effective reporting and BI solution support. Although the specific purpose and objectives can vary a bit by business need, in general, each data warehouse exhibits the following characteristics.

Providing Data for Business Analysis Processes

A data warehouse is designed for business data reporting. Analytical tools such as key performance indicator (KPI) dashboards, business scorecards, pre-defined reports, and ad hoc reporting systems will read data from either a relational data warehouse or online analytical processing (OLAP) cubes, which in turn aggregate data in the relational data warehouse. SQL Server Analysis Services provides a platform to define multidimensional cubes and pre-defined aggregations. OLAP structures are easy to navigate; queries run with optimal performance and support complex calculations that would otherwise be slow and difficult to perform.

Integrating Data from Heterogeneous Source Systems

Reading data from multiple sources can be difficult and inefficient from a client reporting tool. Not only can this be inefficient from a purely technical perspective, but one of the challenges when combining data from different systems is that the records rarely match up. Data entities from various data sources often have slightly different forms and meanings. For example, the customer records in a company's online shopping system can also contain company employees. The customer records in the customer relationship management (CRM) system might not contain employees but can be commingled with sales leads-contacts that have not yet purchased products. From an enterprise perspective, this raises questions about the definition of a business entity as simple as that of a customer. For example, is an employee a customer if he or she buys a product? Is someone a customer if he or she has expressed an interest but has not made a purchase? Is that person your customer if he or she has purchased a product through a vendor? Is a company a customer, or must a customer be a person? And, if a customer is a person (by your definition of the term), who from the purchasing company is the customer: the initial contact, the CEO, the administrative assistant who placed the order, or the financial controller who wrote the check? As you can see, there can be many questions, and the answer might be different for the users of different internal systems, all of which deal with customers in a slightly different way. You get the idea. When designing databases and data warehouse systems, similar conversations arise about dozens of different entities.

Combining Validated Source Data

A truly functional data warehouse contains a single location where all of the data, gathered from different sources, has been standardized, cleansed, and validated. A common problem plaguing most businesses is that important business facts are often captured in spreadsheets and small, department-specific databases. Effectively collecting this data is not a trivial feat, yet it can have tremendous business value. Unless data is generated within a unified business application, combining and validating records can also be challenging. Much of this process can be automated but will often require some manual intervention. Business rules must be established to define the criteria for exact matching of duplicate records. Perhaps the combination of first name, last name, and phone number is a reliable indicator that an imported record already exists. But it's the slight misspellings and abbreviations that can be challenging. This is where SSIS transformations such as Fuzzy Lookup can be helpful to populate a list of candidate matches to be reviewed and validated by hand.

Organizing Data into Nonvolatile, Subject-Specific Groups

One common technique for optimizing reporting solutions is to reduce the volume of data in the decision-support database. However, this might not be so simple. To provide the flexibility and detail necessary to satisfy all of the reporting requirements, a decision-support database can actually end up with more volume than some source systems.

One of the important objectives of traditional normalized database design is to reduce data volume size by eliminating redundancy. This is often at the cost of using several related tables to store business facts. In a data warehouse, to reduce complexity and eliminate multiple table joins, redundant records and values might be necessary. Unlike a normalized, transactional database, which might use only one table to manage transaction records, a data warehouse can have details divided into separate fact tables. Several supporting dimension tables surround each fact table to form a star or snowflake schema. A typical data warehouse might be divided into multiple clusters of related fact and dimension tables.

Storing Data in Structures Optimized for Extraction and Queries

Rather than organizing data into many tables with complex relationships, a data warehouse is organized into simple clusters of fact and dimension tables. Reporting data structures are optimized for speed of retrieval rather than to enforce data integrity rules or to reduce redundant values. The relational database engine retrieves data most efficiently when

Tables are structured and indexed to support queries.
Records are identified by using simple, numeric keys.
Records are related using few joins based on indexed, simple keys.
Record lengths are as short as possible to store necessary values.
Detail records store only numeric values used to perform aggregations.

A data warehouse is a lean query engine that applies these design principles in their purest forms. Its sole purpose is to allow data to be retrieved from read-only tables, using simple queries. It will support queries returning large volumes of data, grouped and aggregated for reporting.

A relational data warehouse can be used as a stand-alone data solution to address many of the issues normally present when the business relies on operational or transactional databases for reporting analytics. Business reports can be designed to query the relational data warehouse directly.

A relational data warehouse system can also be used as the backbone for an OLAP solution. As discussed briefly in this section, even a relational data warehouse has limits that can be over-come by exposing data through an OLAP system such as SQL Server 2005 Analysis Services, with data presented using business reports, dashboards, and specialized reporting tools.