Savings Mart

 < Day Day Up > 

Savings Mart is a fictitious discount retailer operating in a single American state. It has been in business since 2001 and hopes to open new stores by achieving greater operational efficiencies. Since its inception, Savings Mart has relied on a system of adjusting product inventory thresholds to determine when shipments will be made to stores. Every time a product is purchased, the quantity for that product and store is updated. When the quantity dips below the minimum threshold allowed for that product and store, an order is automatically generated and a shipment is made three days later.

Although this process seemed like a good way to ensure that the stores were well stocked, it resulted in shipments being made to each store almost every day. This resulted in high overhead costs for each store. Management now wishes to replace the order/shipment strategy with a system designed to predict shipment dates rather than rely on adjustable thresholds.

A sample application presented in this chapter allows the reader to create a training dataset for Savings Mart based on randomly generated purchases. The reader can then step through the process of creating a mining model based on the dataset.

Loading the Savings Mart Database

To execute the sample code with SQL Server, you will need to create a database using a script file available on the book's Web site. The installation steps are as follows:

1.

Open SQL Server's Query Analyzer and connect to the server where you wish to install the database.

2.

Click File and Open. . .

3.

From the Open Query File dialog, browse to location of the InstallDB.sql file available on the book's Web site. Once selected, click OK.

4.

Click Query and Execute or hit F5.

5.

Once the script has completed executing, a new database named SavingsMart will be visible in the drop-down box on the toolbar.

Figure 5.1 is a diagram of the SavingsMart database. The Products table contains a record for every product available in the stores. Each product is associated with a certain vendor and assigned to a product type, such as beverage or medicine. A field named DiscontinuedDate will contain either a null, meaning it is not discontinued, or the date that it should no longer be available for order. Every product should have a UnitQuantity of one or greater, which indicates the number of items packaged with that product. The products UnitPrice represents the retail price before any discounts are applied. The UnitType and UnitAmount fields may or may not contain values, depending on the product. For instance, a bottled water product will have a UnitType of "oz" and a Unit-Amount of 16.4, indicating that it weighs 16.4 fluid ounces. It is not necessary to record the weight of a mop, so for this product these values would be null.

Figure 5.1. The SavingsMart database. The sample database contains five hundred products and five stores. Stores are stocked with all products according to threshold values contained in the ProductThresholds table. Each time a product is ordered and a shipment completed, the quantity field in ProductQty is updated.


The Purchases table is written to every time a customer makes a single purchase. It records information common to all purchases, such as when the purchase took place, what store it was made in, and what employee rang up the purchase. Purchases are made for products available within a particular store. Availability is determined by examining the Quantity field in the table ProductQty. A purchase can include multiple products and more than one unit of each product. The ProductDetail table is a child of products, and it contains a record for each product associated with a single purchase. If the product purchased is on sale during the time of purchase, then the discount percentage, specified in the ProductDiscounts table is applied.

Once a product is sold beneath the minimum threshold allowed for the store, as indicated by the ProductThresholds table, an order is automatically placed. The quantity for the order is based on the maximum amount found in the ProductThresholds table. Each shipment is the direct result of an order and is typically completed three days after the order is placed.

Populating the Database

Once the SavingsMart database is created, the next step is to populate the database. Unlike the sample databases in Chapters 2 and 3, the SavingsMart database needs to be populated with a large quantity of data. To facilitate this process and provide a method for generating unique training datasets, a sample data-loading application is included on the book's Web site at http://www.awprofessional.com/title/0321246268. The sample Windows application, named LoadSampleData, will allow you to simulate random purchases as well as to initiate orders and shipments needed to restock products.

Utilizing the LoadSampleData program ensures that you are dealing with a clean database. Very often, the most difficult and time-consuming part of successful data mining involves cleaning the historical database to remove or replace records holding invalid values. Refer to the next section, "Cleaning the Database," for more information about this.

The LoadSampleData program also gives you an opportunity to adjust factors affecting the mining model and therefore produce different results. For instance, the program allows you to select certain product types and vendor names that will be purchased more often.

The LoadSampleData application consists of one form, Form1.vb (see Figure 5.2). It utilizes the Microsoft Data Application block to handle data access and the Microsoft Exception Application block to handle writing exceptions to the event log. The Load Data button is used to populate tables with values from text files, available for download from the book's Web site. The following is a list of these text files along with a brief description of what they contain:

  • Stores.txt Data for a total of five stores.

  • Employees.txt Assigns three employees to each store.

  • Vendors.txt Data for a total of thirty-four vendors or product brands.

  • ProductTypes.txt A total of fourteen product types, including such items as Beverages and Kitchen Supplies.

  • Products.txt A total of five hundred products representing each of the product types and vendors.

Figure 5.2. Screenshot of the main form used in the LoadSampleData program. This program will be used to load initial data values into the SavingsMart database. It will also allow the reader to simulate random customer purchases in order to populate a large historical dataset.


The LoadData routine is also used to populate the ProductThresholds table with set values for minimum and maximum threshold amounts. The minimum field represents the minimum quantity of product that should be available in a certain store. The maximum field is the quantity that will be ordered for that store when the minimum threshold is broken. Initially, the minimum and maximum values will be set at ten and two hundred respectively. This will be the case for each product and each store, resulting in a total of twenty-five hundred records (500 products X 5 stores = 2500 records).

Finally, the LoadData routine will generate orders and shipments for each of the five stores. The initial orders will stock the stores with the maximum threshold for all five hundred products. The shipment date will occur three days after the order date to ensure that all stores are fully stocked on the first day of the year.

To begin loading data, execute the following steps:

1.

Copy the contents of the LoadSampleData directory, available for download from the book's Web site, to a location accessible by your development machine. Note the location because it will be used to set the sPath variable in step 4.

2.

Open the LoadSampleData project file with Visual Studio.NET.

3.

From Solution Explorer, right-click Form1 and select View Code.

4.

The top of the form contains two variables that will be unique to your installation. Ensure that the sConn and sPath variables are set correctly. sConn is a string variable containing the connection string used to connect to the SavingsMart database on SQL Server . sPath is a string variable containing the file path to the text files. The text files reside in a subdirectory name TextFiles. This subdirectory is located inside the LoadSampleData directory (created in step 1).

5.

Execute the application by selecting Start from the Debug Menu. Figure 5.2 is a screenshot of form1.

6.

To begin, click the Load Data button and ensure that the message box "Initial Data Load is complete" appears. Note that the form contains several combo and textboxes that will determine what and how data is loaded.

Tip

If you do not wish to load the database utilizing the sample application provided, you can instead attach the database file supplied on the book's Web site. To attach the file, execute the following steps:

1.

Copy the SavingsMart.mdf and SavingsMart.ldf files from the book's Web site to a local directory on the server where Microsoft SQL Server is installed.

2.

Open SQL Server's Enterprise Manager.

3.

Right-click the Databases node and click All Tasks. . . and Attach Database.

4.

From the Attach Database dialog, browse to the directory where you copied the database files in step 1 and select the SavingsMart.mdf file. Click OK.

5.

Click OK to attach the database.

6.

To see the newly added database, you will need to click Action and Refresh.


The next step is to generate purchases for each of the five stores. Data mining is most effective in dealing with large datasets. Therefore, the GeneratePurchases routine, seen as follows, will insert approximately 100,000 records in the PurchaseDetail table for each store and calendar year. Purchases are generated for one store one year at a time.

 'Maximum # of purchases per day Dim nMaxPurchases As Int16 = txtMaxPurchases.Text 'Maximum # of products per purchase Dim nMaxProducts As Int16 = txtMaxProducts.Text 'Maximum value of quantity per product Dim nMaxQuantity As Int16 = txtMaxQuantity.Text Dim sYear As String = cboYear.Text Dim nStoreID As Int16 = cboStoreID.Text 'These are the Product Types in which there is an 'increased chance of product selection 'The default of 1,2 represents snack foods and beverages Dim sProductTypeAdj As String = txtProductTypeAdj.Text 'These are the Vendors in which there is an increased 'chance of product selection 'The default of 20,24,27 represents Kamp, Notch, and PNuts as Vendors Dim sVendorAdj As String = txtVendorAdj.Text ProgressBar1.Minimum = 1 ProgressBar1.Maximum = 366 Try   Dim params(2) As SqlParameter   params(0) = New _    SqlParameter("@ID", SqlDbType.Int)   params(1) = New _    SqlParameter("@ProdTypeAdj", SqlDbType.VarChar, 50)   params(2) = New _    SqlParameter("@VendorAdj", SqlDbType.VarChar, 50)   params(0).Value = nStoreID   params(1).Value = sProductTypeAdj   params(2).Value = sVendorAdj   Dim ds As DataSet = _   SqlHelper.ExecuteDataset(sConn, _    CommandType.StoredProcedure, "GetProductIDS", params)   Dim i As Int16 = 1   Dim dtDate As DateTime   dtDate = Convert.ToDateTime("01/01/" + sYear)   'Loop through everyday of the year   'We assume the store is open every day   Randomize()   Do Until i = 366     'First thing is check to see if orders needs to     'be fulfilled for this day and store     'We assume that all orders are shipped 3 days     'after the orderdate in one shipment     Dim params1(1) As SqlParameter     params1(0) = New _       SqlParameter("@StoreID", SqlDbType.Int)     params1(1) = New _       SqlParameter("@Date", SqlDbType.SmallDateTime)     params1(0).Value = nStoreID     'order was placed 3 days ago     params1(1).Value = dtDate.AddDays(-3)     SqlHelper.ExecuteReader(sConn, _      CommandType.StoredProcedure, "InsertShipment", params1)     Dim x As Int16 = 1     'This will be the total number of purchases for this day     Dim nPurchases As Int16     nPurchases = CInt(Int((nMaxPurchases * Rnd()) + 1))     Do Until x = nPurchases + 1     Dim y As Int16 = 1     Dim nEmployeePos As Int16     nEmployeePos = CInt(Int((ds.Tables(1).Rows.Count * Rnd())))     Dim nEmployeeID As Integer = _        Convert.ToInt32(ds.Tables(1).Rows(nEmployeePos).Item(0))     Dim params2(2) As SqlParameter     params2(0) = New SqlParameter("@ID1", SqlDbType.Int)     params2(1) = New _       SqlParameter("@Date", SqlDbType.SmallDateTime)     params2(2) = New SqlParameter("@ID2", SqlDbType.Int)     params2(0).Value = nStoreID     params2(1).Value = dtDate     params2(2).Value = nEmployeeID     Dim nPurchaseID As Integer = _        SqlHelper.ExecuteScalar(sConn, _      CommandType.StoredProcedure, "InsertPurchase", params2)     'This is total number of products for this purchase     Dim nProducts As Int16 = _       CInt(Int((nMaxProducts * Rnd()) + 1))     Do Until y = nProducts + 1       'This is quantity for this purchase       Dim nQty As Int16 = _         CInt(Int((nMaxQuantity * Rnd()) + 1))       'This is the product for this detail record       Dim nProductPos As Int16 = _         CInt(Int((ds.Tables(0).Rows.Count * Rnd())))       Dim nProductID As Integer = _        Convert.ToInt32(ds.Tables(0).Rows(nProductPos).Item(0))        'Generate the detail record       Dim params3(4) As SqlParameter        params3(0) = New SqlParameter("@StoreID", SqlDbType.Int)       params3(1) = New _         SqlParameter("@ProductID", SqlDbType.Int)       params3(2) = New _         SqlParameter("@PurchaseID", SqlDbType.Int)       params3(3) = New SqlParameter("@Qty", SqlDbType.Int)       params3(4) = New _         SqlParameter("@Date", SqlDbType.SmallDateTime)       params3(0).Value = nStoreID       params3(1).Value = nProductID       params3(2).Value = nPurchaseID       params3(3).Value = nQty       params3(4).Value = dtDate          SqlHelper.ExecuteScalar(sConn, _            CommandType.StoredProcedure, _            "InsertPurchaseDetail", params3)          y += 1        Loop        x += 1       Loop             i += 1       ProgressBar1.Value = i       'Go to the next day       dtDate = dtDate.AddDays(1)      Loop      MessageBox.Show("Purchases for store " _       + Convert.ToString(cboStoreID.Text) + _       " were generated successfully")   Catch ex As Exception      MessageBox.Show(ex.Message)      ExceptionManager.Publish(ex)   End Try 

The amount of records is approximate because the routine utilizes a random number generator to determine the number of purchases per day along with the number of products per purchase. The number of records also varies depending on what input variables are chosen on form1.

The program utilizes default values specifying that purchases will be generated for Store 1 in the year 2001. The GeneratePurchases routine contains a main loop that will execute 365 times for each day of one calendar year. The variable max purchases defaults to 80 and is used to provide the maximum value for the random number generator when determining how many purchases will be generated for a single day.

The variable max products determines the number of distinct products that will be used for a single purchase. Max quantity is used to determine the quantity used in a single purchase detail record. By utilizing the random number generator and then adjusting these values for each store that is processed, we can simulate random purchase activity. In the section titled "Interpreting the Results," we will examine the results of one mining model. To ensure that your results are consistent with the explanations in this section, use the values in Table 5.2 when loading your database.

Table 5.2. Values to be used in the LoadSampleData application when generating purchases for all five stores.

Store

Field Caption

Value (only use the number values and not the literal values in parentheses)

1

Processing Year

2001

 

Max Purchases

80

 

Max Products

20

 

Max Quantity

5

 

Product Type Adjustment

1, 2 (Snack Foods and Beverages)

 

Vendor Adjustment

20, 24, 27 (Kamp, Notch, and Pnuts)

2

Processing Year

2001

 

Max Purchases

60

 

Max Products

12

 

Max Quantity

7

 

Product Type Adjustment

2, 6 (Beverages and Baking Goods)

 

Vendor Adjustment

13, 18 (Gombers and Joe's)

3

Processing Year

2001

 

Max Purchases

70

 

Max Products

12

 

Max Quantity

7

 

Product Type Adjustment

2 (Beverages)

 

Vendor Adjustment

34 (Store Brand)

4

Processing Year

2001

 

Max Purchases

100

 

Max Products

5

 

Max Quantity

2

 

Product Type Adjustment

(leave blank)

 

Vendor Adjustment

34, 18 (Store Brand and Joe's)

5

Processing Year

2001

 

Max Purchases

50

 

Max Products

15

 

Max Quantity

8

 

Product Type Adjustment

6 (Baking Goods)

 

Vendor Adjustment

24, 27, 34 (Notch, Pnuts, and Store Brand)


Of course, since we are using a random number generator, the purchases generated will represent all products equally well over the long run. The Product Type Adjustment and Vendor Adjustment variables are introduced because equal distribution of product purchases is not realistic. These variables contain a comma-delimited list of ProductTypeID and VendorID values. The stored procedure GetProductIDs uses these values when returning the dataset of available ProductID's. If a ProductTypeID is specified, then every product that relates to that product type will be included in the list of available product id's more than once. This will increase the chances that the product will be selected for the PurchaseDetail record. The Vendor Adjustment works similarly in that for each VendorID specified, all products assigned to that vendor will appear in the available product list more than once.

If you do not alter the values on form1, the GeneratePurchases routine will take approximately twenty minutes to load data for each store and calendar year. A progress bar is used to indicate the status of the data load because the process is somewhat time-consuming.

Tip

Although it is not necessary to execute the GeneratePurchases routine for each store and all three years, make sure to load at least one store for one calendar year before continuing. This will provide you with enough data to use for processing.

Once the data has loaded, you should see a dialog with the message "Purchases were generated successfully."

If you do not receive the successful message and instead receive an error message, you will first need to resolve the error. Then you will need to delete the database from SQL Server Enterprise Manager and repeat the steps used to load data.

If you continue to receive an error, consider attaching the database per the instructions in the previous tip.


Case Study: ComputerFleet

ComputerFleet, (www.computerfleet.com.au), based in Sydney, Australia, leases technology equipment to companies in a number of different industries and to government agencies. Its wide range of clients includes small, medium, and large organizations.

Since it was necessary to store a large amount of data (twelve years worth), ComputerFleet was using a data warehouse built on top of Microsoft SQL Server 2000. The company was already using analysis tools like Microsoft Data Analyzer and Microsoft Excel to report on the data.

What ComputerFleet needed was a way to predict when its clients would return leased equipment. Unfortunately, this does not always occur on the agreed lease-end date. Having this information would allow ComputerFleet to better manage its equipment and save money. The company would also be in a better position to value its assets.

ComputerFleet asked a consulting company named Angry Koala (www.angrykoala.com.au) and Microsoft Consulting Services to help it obtain this information. Within a few weeks the consultants had built a data-mining solution with Microsoft Analysis Services.

The data-mining solution uses such attributes as the type of industry, type of equipment, and whether the client is new as input variables to the mining model. The predicted results are stored back in the data warehouse as the predicted asset-arrival time.

ComputerFleet was pleased not only because it was able to make valuable predictions, but because it did not have to purchase any extra proprietary software to do so. Also, it was easy to integrate the data-mining solution with the existing data warehouse in Microsoft SQL Server 2000.

This is a real example of a large company that was able to quickly take advantage of the built-in functionality of Microsoft SQL Server to decrease its operational costs. For more information about ComputerFleet and this Microsoft case study, go to www.microsoft.com/resources/casestudies/casestudy.asp?CaseStudyID=14375.


Cleaning the Database

Cleaning the database is one of the most important tasks in successful data mining. Databases to be mined are often constructed from multiple data sources. These data sources often involve data that is prone to a variety of errors that can destroy any chance you have of making useful predictions. Most everyone has heard or used the phrase "Garbage in, Garbage out." This phrase applies more than ever to data mining.

Possible errors include records with impossible values or values that fall outside the expected range. Also, records that include null values where nulls will negatively impact the decision-making process. Most of these errors could be prevented with database restrictions or application code, but this does not always happen. Since our sample database was artificially created, we can be reasonably sure that these errors do not exist. The following is a list of the errors that could have occurred if our database existed in the real world:

  • Store sale in which the ending data occurs before the starting date.

  • A purchase handled by a store employee before their hire date or for a store they were not assigned to.

  • An order made for a product that was discontinued before the order date.

  • A shipment or order date that is invalid or outside the days of operation for the concerned store.

  • A negative quantity in either PurchaseDetail, OrderDetail, ShipmentDetail, or ProductQty.

  • A product not associated with a vendor or a purchase with no purchase date and quantity.

  • A maximum amount that is greater than the minimum.

The methods used to clean a database can vary. Often, values can be corrected with a few update queries made in Query Analyzer. The hardest part is determining what the correct values should be. More than likely, outside help will be needed from people intimately familiar with the data, such as a store manager.

Creating Views

In order to ease the process of building a mining model, a special view will be created. The view, vw_Shipments, combines fields from five different tables and will be used in the next section to create the mining model. The view utilizes the function fn_GetLastShipmentDate to calculate the number of days between shipments. The Transact-SQL code (viewable from the User Defined Function tab in Enterprise Manager) for this function is as follows:

 CREATE FUNCTION fn_GetLastShipmentDate  (  @ShipmentID int,  @ProductID int  ) RETURNS datetime AS BEGIN  DECLARE @ShippedDate smalldatetime, @TShipmentID int, @ret smalldatetime  DECLARE cursor1 CURSOR SCROLL  FOR select shippeddate, s.shipmentid from shipments s     left join shipmentdetails sd on s.shipmentid = sd.shipmentid     where s.storeid IN (SELECT StoreID FROM Shipments        WHERE shipmentid = @Shipmentid)     AND sd.productid = @ProductID     ORDER BY shippeddate  OPEN cursor1  FETCH NEXT FROM cursor1 INTO @ShippedDate, @TShipmentID    WHILE @@FETCH_STATUS = 0  BEGIN     IF @ShipmentID = @TShipmentID         BEGIN        FETCH PRIOR FROM cursor1 INTO @ShippedDate, @TShipmentID        SET @ret = @ShippedDate        GOTO Close_Cursor         END     FETCH NEXT FROM cursor1 INTO @ShippedDate, @TShipmentID END  Close_Cursor:  CLOSE cursor1  DEALLOCATE cursor1  RETURN(@ret) END 

The function accepts @ShipmentID and @ProductID as input variables. It then opens a scrollable cursor (similar to an ADO resultset) based on the following SQL statement:

 SELECT shippeddate, s.shipmentid FROM shipments s  LEFT JOIN shipmentdetails sd ON s.shipmentid = sd.shipmentid  WHERE s.storeid IN (SELECT StoreID FROM Shipments     WHERE shipmentid = @Shipmentid)  AND sd.productid = @ProductID  ORDER BY shippeddate 

The function loops through the cursor results until it locates the ShipmentID supplied as an input variable. Once located, it moves to the preceding record and returns that shipment date. This shipment date is used as the first variable for the built-in SQL function DATEDIFF. The resulting variable, DaysSinceLastShipped, will be an important column for the Analyze Shipments mining model.

     < Day Day Up > 


    Building Intelligent  .NET Applications(c) Agents, Data Mining, Rule-Based Systems, and Speech Processing
    Building Intelligent .NET Applications(c) Agents, Data Mining, Rule-Based Systems, and Speech Processing
    ISBN: N/A
    EAN: N/A
    Year: 2005
    Pages: 123

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net