Case Study: Creating the Adventure Works Cycles Orders Dimensional Model

This fourth major section of this chapter is meant to impart a sense for how the modeling process might work in the real world. We draw from the Adventure Works Cycles business requirements from Chapter 1 and source system data in the AdventureWorks transaction database for examples of each of the deliverables. The initial design session starts with the knowledge that the modeling team is designing a dimensional model for the Orders business process. According to the four-step process for creating a high-level dimensional model, step two is to choose the dimensions.

Choosing the Dimensions

In order to give you a richer context for the process of choosing dimensions, well go all the way back to step (1) in Figure 2.9: the high-level dimensional model design session. Using the dimensional model shown in Figure 2.12 as our guide, well summarize the Adventure Works Cycles DW/BI teams discussion and conclusions for each dimension, starting with the Date dimension. The dimensional model shown in Figure 2.12 is based on the initial bus matrix we developed as part of the requirements definition process in Chapter 1. The model in Figure 2.12 provides a rough starting point for the high-level dimensional model discussion. At the end of this section, we show the resulting updated high level Orders dimensional model.

Date

There are two important dates in the orders business process: order date and due date. The Date dimension will play two roles in the Orders dimensional model. Recall that a role-playing dimension means the same table will be used multiple times, either through views, synonyms, or physical copies of the table. Analysis Services has the concept of role-playing dimensions built in and doesnt need views or copies.

Take care to include each of the role-playing dimensions in the logical design documentation to facilitate a clear and concise understanding with your business partners and DBAs.

Product

The discussion about product was fairly easy until it became clear that product means different things in different business processes. In Orders, it means the thing that was sold to the customerthe finished good. In manufacturing, it could mean a part or a finished good because manufacturing works with the entire bill of materials (BOM) that make up a given product. To them, every element in the BOM is a product of some process. In order for the Product dimension to be useful across the organization (that is, conformed), it must take the concept down to the lowest granularity. Therefore, the team decided that the Product table will include all product- related items. This decision was made easier when one of the source systems folks pointed out that that is how product data was kept on the source system, and there are only 504 rows in the table.

Sales Territory

The Sales Territory was confusing for the team. Initially, they felt there was no need to keep a separate Sales Territory dimension because Sales Territory is really an attribute of the Sales Rep and Customer. According to the requirements, the Sales organization needs to look at historical orders according to which sales rep got credit at the time of the order and which sales rep will get credit for current orders. The VP of Sales also might want to look at customers by territory in order to realign territories in the future. When they created the initial bus matrix, the team discussed different ways to handle the different requirements.

Adding a separate Sales Territory dimension ties the sales territory directly to the transaction. This locks in the territory to which the order was originally assigned, seemingly meeting the historical orders requirement.

Meanwhile, keeping a Type 1 attribute called Current Sales Territory associated with the sales rep allows the design team to provide current performance reports that show historical orders according to the current territory assignments.

This approach seemed to be the best way to meet the business needs, but the team felt further investigation was needed.

Employee

The Orders business process involves sales reps only. No other employees should appear in the Orders fact table (that cant happen, right?). However, rather than creating a limited use table just for sales reps, the modeling team decided to broaden it to include all employees. After all, building the Employee dimension supports the enterprise view of the bus matrix. A fully loaded Employee table can be used to support other business processes that involve employees .

The initial discussions categorized the employee table as a Type 1 table, containing current values only. The group decided this for two reasons. Changes in the Employee attributes really are a separate business process under Human Resources, so tracking the changes will be part of implementing that business process area and is out of scope of the Orders dimensional model. Second, it was unclear where the data to re-create the history of employee attribute changes would come from.

image from book
TRACKING HISTORYONCE MORE WITH FEELING

To emphasize the importance of tracking attribute changes over time, lets examine the impact of Type 1 versus Type 2 changes for Employee attributes. For example, if the VP of Sales wanted to do some headcount planning and needed to know the productivity of senior sales reps compared to junior sales reps, the information would be incorrect because several of the senior sales reps were hired as junior sales reps. When they got promoted, their old titles would have been overwritten in a Type 1 dimension. This is the same problem presented in the customer commute distance example earlier in this chapter.

The data in the following table shows the different answers Type 1 and Type 2 might give us in the Adventure Works Cycles case. Unfortunately, all the sales reps in the database have the same title: There is no indication of seniority . For the sake of analysis, we can assign seniority with a business rule that says sales reps are considered junior for their first 12 months. In a Type 1 tracking case, the VP of Sales would not be able to create a model because there are no sales for sales reps who were hired in the last 12 months. With Type 2 tracking, the title change generates a new row in the Employee dimension, and all new sales are associated with the new row. Type 2 tracking gives the VP the information he needs.

AVERAGE MONTHLY ORDERS BY SALES TITLE

TYPE 1 REPORT

TYPE 2 REPORT

Position

Avg Monthly

Position

Avg Monthly

Senior

166,354

Senior

195,153

Junior

Junior

119,447

image from book
 

One of the attributes of the Employee table will be the Current Sales Territory to which each employee is assigned (although all employees who are not sales people will be assigned to the HQ sales territory). Reporting orders with this Territory will show all the Sales Reps historical sales aligned by their current Territory.

At this point in the discussion, a few of the team members noted their concern about not having Employee be a Type 2 slowly changing dimension. Several business requirements had been voiced about tracking the impact of organizational changes, and about basic employee counts. (Note that these are questions about the Human Resources business process.) Without a Type 2 dimension, it would be impossible to relate the employee attributes that were in effect at the time a transaction occurred with the actual transaction itself. Only current attribute values could be used to analyze history.

Based on this discussion, the team decided to treat the Employee dimension as a hybrid SCD dimension, with most of the attributes tracked as Type 2 attributes. This decision had significant ripple effect on the model. First, it revealed a weakness in the Sales Territory dimension. Having a Type 2 attribute called Historical Sales Territory in the Employee dimension tracks the same information as the Sales Territory table, only better: It locks in both the sales territory and the sales rep that got credit at the time of the sale. Therefore, the Sales Territory dimension can be removed from the model. Second, there is still a need to apply the currently assigned sales territory to all of history, which means keeping the Current Sales Territory attribute in the Employee dimension as a Type 1 attribute. The ETL process will have to change all historical rows for the Current Sales Territory when the sales territory changes for a given employee. Fortunately, this is one of the standard options in Integration Services Slowly Changing Dimension Wizard.

Note 

The decision to include Type 2 attributes in the Employee dimension really means expanding the scope to include a second business process: Human Resources transactions. In a large organization, the ETL process will essentially create an employee transaction fact table that can then be used to build the Employee dimension. For Adventure Works Cycles, with fewer than 300 employees, this decision is probably not too onerous. For a larger organization with thousands of employees, this could be a lot of work.

The team also noted the issue of finding reliable historical information for employee changes will have to be researched, but lack of historical data is a bad reason to avoid implementing Type 2 tracking. The sooner you get started, the more history you will have.

Customer and Reseller

Next around the circle in Figure 2.12, we run into two related dimensions: Customer and Reseller. As the design team reviewed the business requirements related to customer, it became clear that there were multiple ideas of who the customer was, and that these ideas overlapped . Internet customers are direct-sale individuals, and Reseller customers are mostly wholesale or retail businesses. While they do share some attributes, like an address, the company has certain information about each that it doesnt have about the other. The sense when the bus matrix was created was that the two customer types are different enough to be split into separate dimensions.

During the design session, the design team recognized these heterogeneous customers as a variation on the heterogeneous products idea described earlier in this chapter. The heterogeneous products technique would say to create a separate MasterCustomer dimension that contained only the shared attributes for all customersboth Reseller and Internet. Then create two subset dimensions, one for Internet customers and one for Resellers , with all their unique attributes. These subset dimensions should have the same surrogate keys as their equivalent rows in the master customer dimension.

As it turns out, the design team modified the concept a bit because the reality they face is a bit different. First, there is a clear need to report total sales across both customer types in the same report, so the model must include an integrated MasterCustomer of some kind. Second, there are fewer than 20,000 customers in total, and the distribution is highly imbalanced : Less than 4 percent are resellers. Third, only 10 or so attributes are unique to Reseller (items such as StoreAnnualSales and NumberEmployees). The design team decided to combine all the attributes of the two customer types into a single, master Customer dimension. This decision allows much greater reporting and analysis flexibility and simplifies the dimensional model. However, it means the users need to understand and be able to work with the idea that both customer types are in the same table. When they want a count of Internet customers, they will need to limit the customer type field to Internet.

At the end of the customer discussion, there was still some concern for whether the single customer table would be flexible enough to hold new customer types as Adventure Works Cycles add new distribution channels. The team noted this on the issues list with a reminder to investigate both possible new customer types and usability of a single customer table with the business folks.

Promotion

The requirements document clearly identified promotions as an important element in the marketing mix. However, promotions have not been a major marketing tool for Adventure Works Cycles in the past. In fact, there are only 16 promotions in the source system promotion table. At this point, it is enough to identify promotion as its own dimension and note on the issues list that more research is needed to define promotions because Marketing is planning to do more promotions of greater complexity. The data steward pointed out that tracking more information on promotions may involve enhancements to the source system.

Currency

Every sale in the source system is captured in its original, local currency. The system relies on a currency conversion table for translating currencies and reporting in US dollars. This table tracks the conversion rate between the local currency and US dollars both at the end of each day and as an average for each day. Getting standardized reports in US dollars to compare across countries requires a fairly complicated query and has long been a sore point for most of the folks in headquarters. At the same time, sales people in the field want to create reports in local currency to show their customers. Finance, of course wants both, along with the conversion table so they can assess the impact of exchange rates on budget variances.

Based on these requirements, the design team decided to include both local currency and U.S. dollar fields in the fact table, with a Currency dimension to indicate the currency of the local data. This means the ETL process would have to bring in the currency conversion table to convert non-U.S. sales into U.S. dollars.

Although it was out of scope for the initial cycle, the design team also decided to include the Exchange Rate table. The Exchange Rate table is essentially a fact table, which when combined with the Date table as the Exchange Rate Date and the Currency dimension forms its own business process dimensional model. Making this available to the users was an easy political decision because it is incrementally very little work. The Exchange Rate table must be brought into the ETL staging area in order to support the currency conversion in the Orders table. Besides, the Director of Finance is particularly interested in getting access to this data.

Note 

This is already the second piece of scope creep in the design, the first being treating employee as a Type 2 dimension. We dont encourage this kind of scope creep in real life. In our experience, almost all DW/BI teams are overly ambitious in their first iteration. Were constantly coaching our clients not to over-commityour mantra should be under-promise and over-deliver.

That brings us full circle around the initial high-level dimensional model. We have enough information to generate the Adventure Works Cycles initial high-level dimensional model for Orders, which is the checkpoint labeled (A) in step (1) of Figure 2.9. Before we show the updated model, lets complete the second part of the design session because there still might be a few changes.

Identifying Dimension and Fact Attributes for the Orders Business Process

The second half of the initial design session involves creating an initial data element list. This list is an attribute list for each dimension and a list of fact-related data elements. The starting point for this list is the detailed requirements documentone of its appendices should be a list of key data elements (attributes) that people specifically identified as important.

Figure 2.13 shows what a portion of the attributes list might look like for the Adventure Works Cycles Orders business process dimensional model. The Sample Values column is helpful in identifying attributes.

Promotion

Attribute Name

Description

Alternate Names

Sample Values

Special Offer ID

Source system key

   

Special Offer Name

Name / description of the Special Offer

Promotion name, Special offer description

Volume Discount 11 to 14; Sport Helmet Discount 2002

Discount Percent

Percent item is discounted

   

Special Offer Type

Description of the type of promotion, special offer or discount.

Promotion Type

Volume Discount; Discontinued Product

Special Offer Category

Channel to which the Promotion applies

Promotion Category

Reseller; Customer

Start Date

First day the promotion is available

 

6/15/03

End Date

Last day the promotion is available

 

12/31/04

Minimum Quantity

Minimum quantity required to qualify for the promotion

 

Maximum Quantity

Maximum quantity allowed under the promotion

 

NULL

Notes: The main table that Promotion information comes from is called SpecialOffer in the source system, but the business users refer to the general class as promotions. We may need to replace the Null value in the Maximum Quantity field.

Attributes Not Elsewhere Classified

Attribute Name

Description

Alternate Names

Sample Values

SalesReasonID

Sales reason ID from source system

   

Sales Reason

Reason the customer bought the product, as reported by the customer (Internet only)

 

Demo Event; On Promotion; Price; Review; Sponsorship

Sales Reason Type

Grouping for Sales Reason

 

Marketing; Promotion; Other

Channel

Channel through which the item was ordered

 

Customer; Reseller

Notes: This attribute list is a good candidate to be a junk dimension.


Figure 2.13: Promotion dimension portion of the Adventure Works Cycles initial Orders attribute list
Note 

Creating a stand-alone attribute list can be helpful, but if you are using the modeling spreadsheet we described earlier in the chapter, you already have a place to keep your attribute lists.

The process of creating the attribute list can trigger changes to the initial high-level dimensional model. The Adventure Works Cycles modeling team came across two attributes that did not have an obvious home: Sales Reason and Channel. To borrow a heading from the taxonomy experts, these attributes are shown under the Attributes Not Elsewhere Classified table in Figure 2.13. Users mentioned a couple of fields that are entered by Internet customers to describe why they bought an Adventure Works Cycles bike: Sales Reason and Sales Reason Type. After some exploration, the team found that customers could select one or more reasons for their purchase from a list of reasons. This is an example of a many-to-many relationship, and a good candidate for a bridge table. However, discussions with the business users revealed that they were interested in the primary reason only, which can be identified in the ETL process. In an effort to avoid additional scope creep, the team decided to include only the primary sales reason.

Users also mentioned Sales Channel several timesusually referring to Resellers or the Internet. If Sales Channel refers only to Reseller and Internet, the Customer Type field can handle this distinction. However, it was clear during the requirements interviews that there is a drive to open up new sales channels, including opening Adventure Works Cycles retail stores and providing private label bikes for large retailers.

A quick query of the source system revealed there are only ten sales reasons and two sales channels so creating two separate dimensions seemed inefficient to the design team. They opted to create a junk dimension called OrderInfo that would contain both concepts. The ETL process will have to manage the assignment of surrogate keys and watch for new entries in the source systems.

The Adventure Works Cycles attribute list included several versions of something called Region. People in different parts of the organization expressed a need to track orders by various groupings of the state the order came from, which they called region. The different organizational units each had different definitions of which states made up any given region. After some discussion, the team decided to incorporate the different regions directly in the dimension tables. The data steward needs to push for conformed definitions of the different regions .

Its easiest to start the list of Date dimension attributes from a standard set of attributes and adjust them according to your business requirements. The Analysis Services business intelligence wizard provides a checkbox list of dozens of potential attributes.

One of the groups on the attribute list represents the fact table. This list includes all the ways people measure the business, from dollars to counts to weight to minutes. The fact list should capture any known information about the nature of the measure, such as its aggregation type (such as sum, count, or average). Youll hear users describe a frequently used measure thats actually a derived measure. In other words, its calculated from other measures and formulas. Note that its a derived field and identify its derivation rules if possible. Whatever is not completed here will be filled in during the detailed modeling process.

Revisit the source system tables at this point, as a final check to make sure you havent left anything useful behind. This is meant to be a validation step, not a starting point.

Not all of the data elements on the attribute list will necessarily be attributes of the final dimensional model. Some of them are not really attributes; rather theyre aggregates or constraints. Others are the same attribute masquerading under a different name. Still other attributes are missing altogether, either because they were so obvious people didnt think to mention them, or they were so little used, people didnt know to mention them. As you build the lists, keep an eye out for these kinds of redundancies and omissions. Start to boil down all this information to create the master attribute list for each table.

The Final Draft of the Initial Model

At the end of the initial design session, the team has created a good high-level dimensional model. The high-level model shown in Figure 2.14 is the result of merging the changes identified in the process of creating the attributes list (adding the OrderInfo dimension) along with the changes from the model design session itself (merging Reseller and Customer, removing the Sales Territory dimension, and adding the Exchange Rate tracking business process). This dimensional model contains all the elements needed to meet Adventure Works Cycles orders-related business requirements in a simple, powerful, flexible format least as far as the team understands them at this point. This model will change, but it is a strong first pass.

image from book
Figure 2.14: The high-level dimensional model from the initial design session

We encourage you to compare this model with the initial model in Figure 2.12 to see how it evolved during the initial design session. We also encourage you to review the business requirements described in Chapter 1 to get a sense for how well the model will meet the needs.

The Issues List

The last checkpoint from the initial modeling session is the issues list. Figure 2.15 shows an example of the issues list from the Adventure Works Cycles orders business process design session.

Detailed Dimensional Model Development

Develop the detailed dimensional model one dimension at a time. Begin with the attribute information you captured in your modeling spreadsheet in the initial design session (see Figure 2.13). Start with an easy dimension such as OrderInfo or Date, and fill in as much of the spreadsheet as possible based on current information. Source tables and columns are often fairly clear for most of the attributes. Transformations for most of the attributes are direct copies from the source. Target data types can be inferred based on the source system as well, although the data warehouse DBA will have the final say in determining the data types.

Once the known information is filled in, the open issues are more obvious. At this point, its time to continue with the data exploration/data profiling process described earlier in this chapter.

Identifying SCD Change Types

One of the columns in the modeling spreadsheet is the SCD Change Type. The data modeler will use this to identify how each attribute needs to be tracked over time and flag it appropriately. Remember, this is a business question. Its okay to make a first pass and flag all the attributes whose changes obviously must be tracked over time, or whose changes have no impact on the business whatsoever. All of the less obvious attributes should be discussed with the modeling team. Review all of these change-tracking decisions with the core business users before you make a final decision.

The ETL process must re-create the historical changes for every Type 2 SCD attribute in each dimension, at least as far back in time as the oldest fact table rows the dimension will support. The ETL process will load historical fact rows with the dimension surrogate keys that were in effect when the fact row occurred. The ETL developer has to go back into the transaction system to find all relevant historical transactions that apply to the dimension in question. While this is not the data modelers task, its helpful to look for indicators as to whether or not re-creating historical dimension data will be difficult or even possible.

Reviewing the Issues

Even though you may have filled in most of the spreadsheet, there will still be several issues that require a second or third opinion. In some cases, you can resolve these issues using the research tools described earlier in the chapter. In other cases, you need a sounding board to explore alternative solutions. Bring up open issues in the next data modeling team meeting and work them through.

Then move on to the next dimension.

Identifying the Facts

Filling in the detailed fact table description is much like filling in the dimensions. Start by copying in the list of measures and filling in all the easy items. Then use the research tools to address as many of the open items as possible. Finally, work with the data modeling team to resolve the remaining issues.

There are several issues that are specific to fact tables. These include:

  • Derived columns: Identify the formula and indicate whether the derivation is additive or semi-additive, as in a month end account balance.

  • Allocations : In the case of the Adventure Works Cycles Orders dimensional model, the grain is at the order line item level. The team must decide how to handle the handful of facts that are collected at the order level. Sales tax can easily be allocated to each line item. Other facts, such as shipping costs, might need to be allocated based on weight or size .

    Note 

    Dont avoid the allocations! If you leave shipping costs in a fact table at the order (not line item) level, all your product-related financial rollups will omit the shipping costs. Grit your teeth and allocate!

  • Conformed facts: The dollar sales field is a good example of creating a conformed fact. As the team discussed in the design session, theres a need to have all transactions stated in a single currency (US dollars), as well as the original local currency from the source system.

  • Degenerate dimensions: While no transformations need to be applied to any degenerate dimensions in the dimensional model, you do need to indicate which fields in the fact table are degenerate dimensions. In the Orders dimensional model, there are degenerate dimensions for Order number, Order Line Item Number, and the Sales Order Revision Number. The data modeler will identify another potential degenerate dimension that comes in as part of the order: the Customer PO Number.

Final Dimensional Model

When all the design reviews are finished, the user meetings over, the source systems carefully scrutinized, and the requirements reviewed, its time to physically instantiate the dimensional model. This is a job for the DBAs, but we usually set up a test database and run the script from the spreadsheet. (You will probably already have done this several times by now so you could reverse engineer the model into your modeling tool to create a presentable data model.) You can find the completed modeling spreadsheet we used to create the MDWT_AdventureWorks database at www.MsftDWToolkit.com.

At this point, youre ready to take on the real database physical design process and start thinking about designing the ETL system.



Microsoft Data Warehouse Toolkit. With SQL Server 2005 and the Microsoft Business Intelligence Toolset
The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset
ISBN: B000YIVXC2
EAN: N/A
Year: 2006
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net