Flylib.com

Books Software

 
 
 

Dimensional Data Warehousing with MySQL: A Tutorial - page 46


Summary

In this chapter you learned source data mapping and the process called initial population to load historical data that your users specified. In the next chapter, you learn regular population that you need to operate your data warehouse.



Chapter 8: Regular Population

This chapter covers regular population. Unlike initial population that you perform once only before the start of your data warehouse operation, you schedule a regular population to load the source data regularly.

In this chapter I show you how to prepare data before running a script that does regular population in our dw database.

Identifying Data Sources and Loading Types

The first step to schedule a regular population is identify what source data is needed and available for every fact and every dimension of the data warehouse. Afterwards, you decide the extraction mode and the loading type suitable for the population. A sample document that summarizes this information is shown in Table 8.1.

Table 8.1: Data sources and loading types of regular population
Open table as spreadsheet

Source Data

Data Warehouse Table

Extraction Mode

Loading Type

Customer

customer_dim

Whole, Pull

SCD2 on address

SCD1 on name

Product

product_dim

Whole, Pull

SCD2

Sales order Transaction

order_dim

CDC (daily), Pull

Unique order number

sales_order_fact

CDC (daily), Pull

Daily sales orders

n/a

date_dim

n/a

Pre-populate

Note 

You learned about extraction mode in Chapter 5, “Source Extraction” and SCD in Chapter 2, “Dimension History.”

Another aspect of the source data that might have impact on your design is the window of time when a particular data is available for the regular population. This is especially important for transactional source data that is usually large, such as sales orders.

In addition, you need to know the detailed characteristics of every source data, such as its file type and record structure, down to the individual field or column.



Regular Population Script

I use the script in Listing 8.1. to explain how regular population works. You can use it to populate the data warehouse on a daily basis. The extraction mode and loading type used are as follows :

  • The customer.csv and product.txt flat files are loaded into the customer_dim and product_dim tables through the customer_stg and product_stg tables, respectively. Loading is achieved through the use of MySQL’s LOAD DATA INFILE utility.

  • SCD2 is applied to customer addresses, product names, and product groups. SCD1 is applied to customer names .

  • Only sales orders entered on the current date are loaded to the order_dim and sales_order_fact tables.

Listing 8.1: Daily dw regular population

image from book
/*****************************************************************/ /* */ /* dw_regular.sql */ /* */ /*****************************************************************/ USE dw; /* customer dimension loading */ TRUNCATE customer_stg ; LOAD DATA INFILE 'customer.csv' INTO TABLE customer_stg FIELDS TERMINATED BY ', ' OPTIONALLY ENCLOSED BY ' " ' LINES TERMINATED BY '\r\n' IGNORE 1 LINES ( customer_number , customer_name , customer_street_address , customer_zip_code , customer_city , customer_state) ; /* SCD2 on customer street addresses */ /* first, expire the existing customers */ UPDATE customer_dim a , customer_stg b SET expiry_date = SUBDATE (CURRENT_DATE, 1) WHERE a.customer_number = b.customer_number AND a.customer_street_address <> b.customer_street_address AND expiry_date = '9999–12–31' ; /* then, add a new row for the customer */ INSERT INTO customer_dim SELECT NULL , b.customer_number , b.customer_name , b.customer_street_address , b.customer_zip_code , b.customer_city , b.customer_state , CURRENT_DATE , '9999–12–31' FROM customer_dim a , customer_stg b WHERE a.customer_number = b.customer_number AND (a.customer_street_address <> b.customer_street_address) AND EXISTS( SELECT * FROM customer_dim x WHERE b.customer_number=x.customer_number AND a.expiry_date=SUBDATE (CURRENT_DATE, 1)) AND NOT EXISTS ( SELECT * FROM customer_dim y WHERE b.customer_number = y. customer_number AND y.expiry_date = '9999–12–31') ; /* SCD1 on customer name */ UPDATE customer_dim a, customer_stg b SET a.customer_name = b.customer_name WHERE a.customer_number = b.customer_number AND a.customer name <> b.customer name ; /* add new customer */ INSERT INTO customer_dim SELECT NULL , customer_number , customer_name , customer_street_address , customer_zip_code , customer_city , customer_state , CURRENT_DATE ,'9999–12–31' FROM customer_stg WHERE customer_number NOT IN( SELECT y.customer_number FROM customer_dim x, customer_stg y WHERE x.customer_number = y.customer_number) ; /* product dimension loading */ TRUNCATE product_stg ; LOAD DATA INFILE 'product.txt' INTO TABLE product_stg FIELDS TERMINATED BY ' ' OPTIONALLY ENCLOSED BY ' ' LINES TERMINATED BY '\r\n' IGNORE 1 LINES ( product_code , product_name , product_category) ; /* SCD2 on product

name

and

group

*/ /* first, expire the existing product */ UPDATE product_dim a , product_stg b SET expiry_date = SUBDATE (CURRENT_DATE, 1) WHERE a.product_code = b.product_code AND ( a.product_name <> b.product_name OR a.product_category <> b.product_category) AND expiry_date = '9999–12–31' ; /* then, add a new row for the product */ INSERT INTO product_dim SELECT NULL , b.product_code , b.product_name , b.product_category , CURRENT_DATE ,'9999–12–31' FROM product_dim a , product_stg b WHERE a.product_code = b.product_code AND ( a.product_name <> b.product_name OR a.product_category <> b.product_category) AND EXISTS( SELECT * FROM product_dim x WHERE b.product_code = x.product_code AND a.expiry_date = SUBDATE (CURRENT_DATE, 1)) AND NOT EXISTS ( SELECT * FROM product_dim y WHERE b.product_code = y.product_code AND y.expiry_date = '9999–12–31') ; /* add new product */ INSERT INTO product_dim SELECT NULL , product_code , product_name , product_category , CURRENT_DATE , '9999–12–31' FROM product_stg WHERE product_code NOT IN( SELECT y.product_code FROM product_dim x, product_stg y WHERE x.product_code = y.product_code) ; /* end of product_dim loading */ INSERT INTO order_dim ( order_sk , order_number , effective_date , expiry_date) SELECT NULL , order_number , order_date , '9999–12–31' FROM source.sales_order WHERE entry_date = CURRENT_DATE ; INSERT INTO sales_order_fact SELECT order_sk , customer_sk , product_sk , date_sk , order_amount FROM source.sales_order a , order_dim b , customer_dim c , product_dim d , date_dim e WHERE a.order_number = b.order_number AND a.customer_number = c.customer_number AND a.order_date >= c.effective_date AND a.order_date <= c.expiry_date AND a.product_code = d.product_code AND a.order_date >= d.effective_date AND a.order_date <= d.expiry_date AND a.order_date = e.date AND a.entry_date = CURRENT_DATE ; /* end of script */
image from book