This section provides an introduction to dimensional modeling rather than a misguided attempt to teach modeling in a few pages. Modeling is a practical discipline, and the reality is that you will only get good at it through practicethat is why each chapter in this book walks you through some of the data modeling decisions for a given business solution.
The primary difference between E/R modeling and dimensional modeling is that for E/R modeling, you mostly look at the data and apply normalization rules, whereas for dimensional modeling, you listen to the users and apply your common sense. The OLAP cubes and subsequent analyses that are built on top of a relational database schema are not the right places to transform a complicated schema into a simple, user-friendly modelyou will be building the simplicity right into your schema.
A well-designed dimensional model uses the names and concepts that the business users are familiar with rather than the often-cryptic jargon-ridden terms from the source systems. The model will consist of fact tables and their associated dimension tables and can generally be understood by even nontechnical people.
Designing a Dimensional Data Model
The first question you will probably have when starting your first BI project will be "Where on earth should I start?." Unlike most modeling exercises, there will probably be fairly limited high-level business requirements, and the only information you will probably have to work with is the schemas of the source systems and any existing reports. You should start by interviewing (and listening to) some of the users of the proposed information and collect any requirements along with any available sample reports before beginning the modeling phase.
After you have this information in hand, you can move on to identifying which business processes you will be focusing on to deliver the requirements. This is usually a single major process such as sales or shipments, often building on an existing data warehouse that supports solutions from previous iterations of the BI process.
The remaining steps in the modeling process are essentially to identify the dimensions, measures, and the level of detail (or grain) of every fact table that we will be creating. The grain of a fact table is the level of detail that is stored in the table and is determined by the levels of the dimensions we include. For example, a fact table containing daily sales totals for a retail store has a grain of Day by Store by Product.
This process is often described in sequence, but the reality of doing dimensional modeling is that you will cycle through these steps a number of times during your design process, refining the model as you go. It's all very well when books present a dimensional data model as if it sprang to life by itself; but when you are swamped by the information you have collected, it helps to have some concrete goals to focus on.
Making a List of Candidate Attributes and Dimensions
When you are reviewing the information you have collected, look for terms that represent different ways of looking at data. A useful rule of thumb is to look for words such as by (as in, "I need to see profitability by product category"). If you keep a list of all these candidate attributes when you find them, you can start to group them into probable dimensions such as Product or Customer.
One thing to be careful of is synonyms: People often have many different ways of naming the same thing, and it is rare that everyone will agree on the definition of every term. Similarly, people in different parts of the business could be using the same term to mean different things. An important job during the modeling process is to identify these synonyms and imprecise names and to drive the business users toward consensus on what terms will mean in the data warehouse. A useful by-product of this process can be a data dictionary that documents these decisions as they are made.
Making a List of Candidate Measures
At the same time that you are recording the attributes that you have found, you will be looking for numeric measures. Many of the candidate measures that you find will turn out to be derived from a smaller set of basic measures, but you can keep track of all them because they might turn out to be useful calculations that you can add into the OLAP cube later. The best candidates for measures are additive and atomic. That is, they can be added up across all the dimensions, including time, and they are not composed from other measures.
Grouping the Measures with the Same Grain into Fact Tables
Figuring out how to group measures into fact tables is a much more structured process than grouping related attributes into dimension tables. The key concept that determines what measures end up on a fact table is that every fact table has only one grain. After you have your list of candidate measures, you can set up a spreadsheet as shown in Table 1-1 with the candidate dimensions on the columns and the candidate measures on the rows.
For each measure, you need to figure out the grain or level of detail you have available. For example, for a specific sales amount from a sales transaction, you can figure out the customer that it was sold to, the product SKU that they bought, and the day that they made the purchase, so the granularity of the sales amount measure is Product SKU by Customer by Day. For budget amount, the business is only producing monthly budgets for each product category, so the granularity is Product Category by Month.
From the example in Table 1-1, we end up with two different fact tables. Because the Sales Amount and Quantity measures both have the same granularity, they will be on the Sales fact table, which will also include Product, Customer, and Date dimension keys. A separate Budget fact table will have Product Category and Date dimension keys and a Budget Amount measure.
Identifying the granularity of every measure sounds simple in principle but often turns out to be difficult in practice. So, how do you know when you have made a mistake? One common sign is when you end up with some records in the fact table with values for one set of measures and nulls for the remainder. Depending on how you load the data, you could also see that a given numeric quantity ends up being repeated on multiple records on a fact table. This usually occurs when you have a measure with a higher granularity (such as Product Category rather than Product SKU) than the fact table.
Fitting the Source Data to the Model
The output of the modeling process previously outlined is a set of dimension and fact table designs. It is important to recognize that you need actual data to validate the design, so we recommend creating the dimension and fact tables and loading some test data into them during the design process. SQL Server Integration Services (SSIS) is a great help here because you can easily use it to populate tables, as we show in Chapter 4, "Integrating Data."
Note that the reason for including real data in your modeling process is certainly not to bend your model to fit some data model issue in a source systemthe right place to correct those kinds of issues is in the ETL process that loads the data, not by messing up your design. Loading test data during the design phase is really a recognition that even skilled dimensional modelers don't expect to get it right the first time. You need to build in prototyping and early demonstrations to prospective users to gather feedback during the modeling process.