Slowly Changing Dimension Type 2 (SCD2)


I discussed the surrogate key in Chapter 1, “Basic Components” and up to now it’s probably still not clear to you why you ever need one at all. You might still be wondering why you need another key when you already have the data source’s key (also called the natural key). You will soon learn that you use a surrogate key (as well as the effective date and the expiry date) to maintain dimension history using SCD2.

You cannot use a source key to implement SCD2. The source key of the product dimension, for example, is stored in the product_code column. In SCD2 if the product name changes, to maintain history you must keep the previous record and add a new record with the new name. However, you can’t have two records with the same key, the product code. That’s why you need a surrogate key.

The effective and expiry dates define the validity of a record. When a new version of an existing dimension record gets created, you expire the existing record by changing its expiry date to one day prior to the effective date of the new record. For instance, if the effective date of the new record is January 19, 2008, you set the expiry date of the existing record to January 18, 2008. You also set the expiry date of the new record to 9999–12–31. By contrast, SCD1 does not use effective dates and expiry dates.

The script in Listing 2.3 applies SCD2 to the product_dim table. As such, whenever there is a change in the product_name or product_category columns, SCD2 expires the existing row and adds a new row that describes the same product. Note that the script in Listing 2.3 assumes that new product information is available in a staging table called product_stg.

Listing 2.3: Applying SCD2 to product_name and product_category in the product_dim table

image from book
 /*****************************************************************/ /*                                                               */ /* scd2.sql                                                      */ /*                                                               */ /*****************************************************************/ /* default database to dw                                        */ USE dw; /* expire the existing product                                   */ UPDATE   product_dim a , product_stg b SET      expiry_date = SUBDATE (CURRENT_DATE, 1) WHERE      a.product_code = b.product_code AND ( a.product_name <> b.product_name      OR a.product_category <> b.product_category ) AND expiry_date = '9999-12-31' ; /* add a new row for the changing product                          */ INSERT INTO product_dim SELECT   NULL , b.product_code , b.product_name , b.product_category , CURRENT_DATE , '9999-12-31' FROM   product_dim a , product_stg b WHERE      a.product_code = b.product_code AND ( a.product_name <> b.product_name      OR a.product_category <> b.product_category ) AND EXISTS ( SELECT * FROM product_dim x WHERE b.product_code = x.product_code   AND a.expiry_date = SUBDATE (CURRENT_DATE, 1)       AND NOT EXISTS ( SELECT *    FROM product_dim y    WHERE    b.product_code = y.product_code          AND y.expiry_date = '9999-12-31' ) ; /* add new product                                              */ INSERT INTO product_dim SELECT   NULL , product_code , product_name , product_category , CURRENT_DATE , '9999-12-31' FROM product_stg WHERE product_code NOT IN( SELECT y.product_code FROM product_dim x, product_stg y WHERE x.product_code = y.product_code ) ; /* end of script                                                 */
image from book

Note 

In a production environment you schedule this script and related scripts to run regularly within the data warehouse load cycle. (Regular loading is discussed in Chapter 8, "Regular Population.")

Before you can run the script in Listing 2.3, you need to prepare the data first and create the product_stg table. In fact, there are two stages of preparation. The first stage is comprised of three steps:

  1. Preparing the product source file.

  2. Creating the product_stg table and loading the product source file by running the create_product_stg.sql script (given in Listing 2.4) and the load_product_stg.sql script (shown in Listing 2.5).

  3. Running the scd2.sql script to initially populate the product_dim table

The second stage consists of two steps:

  1. Changing the product source file and loading it into its staging

  2. Applying SCD2 by running the scd2.sql script again and confirming SCD2 has been applied correctly.

All the steps are discussed in the following subsections.

Preparing the Product Source File

You will load two products into the product_dim table from a fixed-width text file named product.txt included in the zip file accompanying this book. Here is the content of the file.

 PRODUCT CODE  PRODUCT NAME  PRODUCT GROUP 1             Hard Disk     Storage 2             Floppy Drive  Storage

All you need to do is copy this file to the c:\mysql\data\dw directory. In fact, you will need to copy all flat files to this directory.

Creating the Staging Table and Loading Data

You create the product_stg table using the script in the create_product_stg.sql script in Listing 2.4.

Listing 2.4: Creating the product_stg table

image from book
 /*****************************************************************/ /*                                                               */ /* create_product_stg.sql                                        */ /*                                                               */ /*****************************************************************/ /* default database to dw                                        */ USE dw; CREATE TABLE product_stg ( product_code INT , product_name CHAR (30) , product_category CHAR (30)) ; /* end of script                                                 */
image from book

Now run the create_product_stgsql script.

 mysql> \. c:\mysql\scripts\create_product_stg.sql

Next, you use MySQL’s LOAD DATA INFILE command to load the content of a flat file into a MySQL table. The load_product_stg.sql script in Listing 2.5 loads the product.txt file to the product_stg table.

Listing 2.5: Loading products to its staging table

image from book
 /*****************************************************************/ /*                                                               */ /* load_product_stg.sql                                          */ /*                                                               */ /*****************************************************************/ /* default database to dw                                        */ USE dw; /* clean up the staging table                                    */ TRUNCATE product_stg; /* use LOAD DATA INFILE                                          */ LOAD DATA INFILE 'product.txt' INTO TABLE product_stg FIELDS TERMINATED BY '' OPTIONALLY ENCLOSED BY '' LINES TERMINATED BY '\r\n' IGNORE 1 LINES ( product_code , product_name , product_category ) ; /* end of script                                                 */
image from book

You can run the script in Listing 2.5 by using this command.

 mysql> \. c:\mysql\scripts\load_product_stg.sql

The response should be similar to the following.

 Database changed Query OK, 1 row affected (0.09 sec) Query OK, 2 rows affected (0.09 sec) Records: 2  Deleted: 0  Skipped: 0 Warnings: 0

Next, query the product_stg table to confirm that you have loaded the two products successfully.

 mysql> select * from product_stg;

Here is how the content of the product_stg table should look like.

 +----------------+---------------+-------------------+ |  product_code  |  product_name |  product_category | +----------------+---------------+-------------------+ |             1  |  Hard Disk    |  Storage          | |             2  |  Floppy Drive |  Storage          | +----------------+---------------+-------------------+ 2 rows in set   (0.00 sec)

Running the Initial Population

Set your operating system date to February 3, 2007, then run the scd2.sql script in Listing 2.3 to initially populate the product_dim table with the two products.

 mysql> \. c:\mysql\scripts\scd2.sql

You should see this on your console.

 Database changed Query OK, 0 rows affected (0.16 sec) Rows matched: 0  Changed: 0  Warnings: 0 Query OK, 0 rows affected (0.02 sec) Records: 0  Duplicates: 0  Warnings: 0 Query OK, 2 rows affected (0.06 sec) Records: 2  Duplicates: 0  Warnings: 0

Then, query the product_dim table to confirm that you have loaded the two products successfully.

 mysql> select * from product_dim \G

You should see the following response.

 *************************** 1. row ***************************       product_sk: 1     product_code: 1     product_name: Hard Disk product_category: Storage   effective_date: 2007-02-03      expiry_date: 9999-12-31 *************************** 2 row ***************************       product_sk: 2     product_code: 2     product_name: Floppy Drive product_category: Storage   effective_date: 2007-02-03      expiry_date: 9999-12-31 2 rows in set (0.00 sec)

Changing the Product Source File

Next, using a text editor, update the product.txt file you used earlier. You need to change the name of product code 1 into "Hard Disk Drive" and add product code 3.

 PRODUCT CODE,PRODUCT NAME,PRODUCT GROUP 1          Hard Disk Drive               Storage 2          Floppy Drive                  Storage 3          LCD Panel                     Monitor

Applying SCD2

Set your MySQL date to February 5, 2007. This is necessary to simulate a later date than the date of the previous (initial) product population (February 3, 2007).

Now, run the load_product_stg.sql script in Listing 2.5 one more time and then run the scd2.sql script in Listing 2.3.

 mysql> \. c:\mysql\scripts\load_product_stg.sql

Here is what you should see on your console.

 Database changed Query OK, 2 rows affected (0.11 sec) Query OK, 3 rows affected (0.06 sec) Records: 3  Deleted: 0  Skipped: 0  Warnings: 0 mysql> \. c:\mysql\scripts\scd2.sql Database changed Query OK, 1 row affected (0.43 sec) Rows matched: 1  Changed: 1  Warnings: 0 Query OK, 1 row affected (0.08 sec) Records: 1  Duplicates: 0  Warnings: 0 Query OK, 1 row affected (0.05 sec) Records: 1  Duplicates: 0  Warnings: 0

To confirm SCD2 has worked correctly, query the product_dim table.

 mysql> select * from product_dim \G *************************** 1. row ***************************       product_sk: 1     product_code: 1     product_name: Hard Disk product_category: Storage   effective_date: 2007-02-03      expiry_date: 2007-02-04 *************************** 2. row ***************************       product_sk: 2     product_code: 2     product_name: Floppy Drive product_category: Storage   effective_date: 2007-02-03      expiry_date: 9999-12-31 *************************** 3. row ***************************       product_sk: 3     product_code: 1     product_name: Hard Disk Drive product_category: Storage   effective_date: 2007-02-05      expiry_date: 9999-12-31 *************************** 4. row ***************************       product_sk: 4     product_code: 3     product_name: LCD Panel product_category: Monitor   effective_date: 2007-02-05      expiry_date: 9999-12-31 4 rows in set (0.00 sec)

Analysing the Result

The following analysis on the product_dim table proves that SCD2 has been applied correctly.

  • Product 1 has two rows. One of the two rows (whose product_sk is 1) has expired. Its expiry date is February 4, 2007, that is one day earlier than the date you applied SCD2 (February 5, 2007). The other row (whose product_sk is 3) has a new name. Its effective date is February 5, 2007. Its expiry date is 9999–12–31, which means it has not expired.

  • The new product with product code 3 is added, its effective date is February 5, 2007.



Dimensional Data Warehousing with MySQL. A Tutorial
Dimensional Data Warehousing with MySQL: A Tutorial
ISBN: 0975212826
EAN: 2147483647
Year: 2004
Pages: 149

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net