Generating New Shipment Schedules | Building Intelligent .NET Applications(c) Agents, Data Mining, Rule-Based Systems, and Speech Processing

< Day Day Up >

The Mining-Model Results

In Chapter 5, Savings Mart used a system of minimum and maximum thresholds to indicate when shipments to stores should be generated. At the end of one year, management realized that 295 shipments were made for Store 1 alone. Thus, for every day of the year, there was an 81 percent probability (295/365 = .8082191) that a shipment would be made. Since each shipment involves substantial operating costs, management knew that a better system would have to be devised.

In Chapter 5, Analysis Services was used to build a mining model named "Analyze Shipments." This mining model attempted to predict the most influential factors affecting the days between shipments and the quantity to be shipped. When we examine the Content through the Relational Mining Model Editor, we can see that the most significant factor affecting days since last shipped is the vendor name.

If we drill down on the vendor named Ables (see Figure 6.1), we see that the next significant factor is the Store ID. The Store ID is used to distinguish predictions on two levels. The first split separates predictions for Store ID 1 from all the other stores. Beyond that, it also separates predictions for Store ID 3 from all the other stores in this case Stores 2, 4, and 5. For the remaining stores, there is a 90 percent probability that there will be 111 days between shipments.

Figure 6.1. Screenshot of the Content Editor as it displays the node path where vendor name is Ables and the Store ID is not 1 and not 3.

Note

Prediction results are dependent upon the data and will vary if you choose to run the LoadSampleData program in Chapter 5 with parameters different from the ones listed in the book. If you generated purchases just for Store 1 and not the others, the most significant factor will be product type and not vendor name.

For every vendor there are predictions of the time to ship products and what quantity should be shipped. The next step is to get the information into a useable format so that new shipping schedules can be estimated. This involves the ability to query the mining-model results for specific vendors and stores.

Refreshing the Mining Model

Once a mining model has been trained and tested, and you believe it will produce good results, you will need to set up a process that allows the mining model to be refreshed periodically. Refreshing a mining model involves retraining the mining model using new and/or updated data without recreating it. This can be done manually using Analysis Manager, or programmatically using Decision Support Objects (DSO).

The LoadSampleData project from Chapter 5 has been modified to include additional procedures for this chapter. The revised project LoadSampleData is available for download from the book's Web site. If you copy the files to your local machine, open the solution file, and expand the References node in Solution Explorer, you will see a reference for ADODB and DSO. The DSO reference is a COM-based reference to the Microsoft Decision Support Object library. This will allow us to programmatically process the mining model. The third reference is a .NET reference to the ADOMD library namespace, Microsoft.AnalysisServices.AdomdClient. This will only be available after you have successfully installed the ADOMD.NET SDK.

Open the main form by double-clicking Form1.vb in Solution Explorer. You should see a new button titled Apply Shipment Schedules (see Figure 6.2). Clicking this button will execute two routines named ProcessMiningModel and PopulateShipmentSchedulesTable. Note that the input variables available in the last chapter were removed and hard-coded within the program to speed processing time.

Figure 6.2. Screenshot of the main form used by the LoadSampleData project available with Chapter 6. This project is based on the same one in Chapter 5, but applies the predictions made by Analysis Services.

Since the data is not stored with the mining model, the model will need to be reprocessed periodically. Processing a mining model is accomplished by right-clicking the model name from Analysis Manager and selecting Process. When a mining model is processed, the processing method is either a Full process or Refresh Data (see Figure 6.3). Refreshing the data involves retraining the model without completely recreating it. This is the faster option and the one that should be chosen if no changes are made to the structure of the model.

Figure 6.3. Screenshot of the wizard that appears when a mining model is processed from within Analysis Manager. Unless the structure of the model is changed, Refresh Data is the option that will be chosen here.

The ProcessMiningModel procedure programmatically accomplishes the same thing as clicking Process. . . from the Analysis Manager interface. It will reexamine the data using the mining model specified. It first opens a connection to the Analysis Server and sets a reference to the Analyze Shipments mining model (named oModel). It then initiates a refresh of the mining model by executing the following lines of code:

 Try   oModel.LockObject(OlapLockTypes.olapLockProcess, _       "Training the Shipments mining model")   oModel.Process(ProcessTypes.processRefreshData)   oModel.UnlockObject() Catch ex As Exception   MessageBox.Show("Unable to refresh the mining model")   Exit Sub End Try

A lock is placed on the object before processing occurs in order to prevent conflicts and is released after the data is refreshed. At this point, what is refreshed are the predictions themselves. This is because Analysis Services does not store the raw data used as input, but instead stores the predictions made as a result of applying the data-mining algorithm.

Querying the Mining Model

The PopulateShipmentSchedulesTable procedure, as follows, is where we will actually query the mining model for results.

 ' Open the connection to Analysis Services AScn.ConnectionString = sASConn AScn.Open() ' Get a list of vendors and stores from our SQL tables Dim ds As DataSet = SqlHelper.ExecuteDataset(sConn, _     CommandType.StoredProcedure, "GetStoresAndVendors") ' Now we will loop through the stores and vendors to get the ' predicted quantity and days between shipments. ' Results will be stored in the ShipmentSchedules table Dim Vrow As DataRow Dim nStoreID As Int16 Dim nVendorID As Int16 Dim sVendor As String, sType As String For Each Vrow In ds.Tables(0).Rows()    nVendorID = Vrow(0)    sVendor = Convert.ToString(Vrow(1)).Replace("'", "''''")    sType = Convert.ToString(Vrow(2)).Replace("'", "''''")    Dim SRow As DataRow    For Each SRow In ds.Tables(1).Rows()       nStoreID = SRow(0)       Dim cmd As New AdomdCommand       cmd.Connection = AScn       'Version 1       Dim sSQL As String = "SELECT Predict("       sSQL += " Days Since Last Shipped]), "       sSQL += " Predict(Quantity)"       sSQL += " FROM [Analyze Shipments] PREDICTION JOIN "       sSQL += "(SELECT '" + sVendor + "' as VendorName, "       sSQL += Convert.ToString(nStoreID) + " as StoreID) as t "       sSQL += "ON [Analyze Shipments].[Vendor Name] = t.VendorName"       sSQL += " AND [Analyze Shipments].[Store ID] = t.StoreID "       'Version 2       'Dim sSQL As String = "SELECT Predict("       'sSQL += " Days Since Last Shipped]), "       'sSQL += " Predict(Quantity)"       'sSQL += " FROM [Analyze Shipments] PREDICTION JOIN "       'sSQL += "(SELECT '" + sVendor + "' as VendorName, "       'sSQL += Convert.ToString(nStoreID) + " as StoreID, "       'sSQL += "'" + sType + "' as Type) as t "       'sSQL += "ON [Analyze Shipments].[Vendor Name] = t.VendorName "       'sSQL += "AND [Analyze Shipments].[Store ID] = t.StoreID "       'sSQL += "AND [Analyze Shipments].[Product Type] = t.Type "       cmd.CommandText = sSQL       Dim dr As AdomdDataReader = cmd.Execute()       dr.Read()       ' Insert the results into the ShipmentSchedules table       If IsUpdate = False Then          Dim params(3) As SqlParameter          params(0) = New SqlParameter("@StoreID", SqlDbType.Int)          params(1) = New SqlParameter("@VendorID", SqlDbType.Int)          params(2) = New SqlParameter("@Days", SqlDbType.Int)          params(3) = New SqlParameter("@Qty", SqlDbType.Int)          params(0).Value = nStoreID          params(1).Value = nVendorID          params(2).Value = Convert.ToString(dr.Item(0))          params(3).Value = Convert.ToString(dr.Item(1))          Dim ret As Integer = SqlHelper.ExecuteScalar(sConn, _           CommandType.StoredProcedure, "InsertShipmentSchedule", _           params)       Else          Dim params(3) As SqlParameter          params(0) = New SqlParameter("@StoreID", SqlDbType.Int)          params(1) = New SqlParameter("@VendorID", SqlDbType.Int)          params(2) = New SqlParameter("@Days", SqlDbType.Int)          params(3) = New SqlParameter("@Qty", SqlDbType.Int)          params(0).Value = nStoreID          params(1).Value = nVendorID          params(2).Value = Convert.ToString(dr.Item(0))          params(3).Value = Convert.ToString(dr.Item(1))          Dim ret As Integer = SqlHelper.ExecuteScalar(sConn, _           CommandType.StoredProcedure, "UpdateShipmentSchedule", _           params)        End If        dr.Close()     Next Next

The first line of code in this procedure creates a new AdomdConnection object. We then populate a regular ADO.NET dataset with a listing of all the vendors and stores. For each store we will loop through a listing of all the vendors and build the mining-model query. The code includes a commented version of the dynamic SQL statement. The commented version includes product type as a query variable. For now, product type is not included, since initial processing did not identify it as a factor affecting the outcome. This will change after further processing and will be discussed later. For now, Version 1 is the statement that will be executed.

Building a Prediction Query

The syntax for the query is very similar to a transact SQL statement executed against the SQL server. The string that follows is what one such query string looks like.

 SELECT Predict([Days Since Last Shipped]),Predict(Quantity) FROM [Analyze Shipments] PREDICTION JOIN (SELECT 'Ables' as VendorName, 1 as StoreID) as t ON [Analyze Shipments].[Vendor Name] = t.VendorName AND [Analyze Shipments].[Store ID] = t.StoreID

In this query we select the two predicted outcome variables from the Analyze Shipments mining model. The PREDICTION JOIN keyword is used to specify that we want to return values for a specific vendor and store.

This particular query is known as a prediction query because it is used to retrieve results from an already trained model. The other type of query that Analysis Services supports is a content query. This type of query is used to browse the contents of a mining model and can be useful if you are programmatically trying to identify trends in the processed data.

Tip

If you would like to review the content of your mining model as raw XML, you can perform a query using Predictive Model Markup Language, or PMML. For instance, a query such as

SELECT * FROM [AnalyzeShipments].PMML

will return a result set in which item 5 contains the raw XML stream. The stream contains the actual statement used to create the mining model as well as a data dictionary that lists the categories and attributes involved. The global statistics section lists the data distribution or clustered segments for each attribute. Finally, the tree model contains the actual predictions made by the model's algorithm.

To view the raw XML associated with the Analyze Shipments model, refer to the pmml.xml file, available for download from the book's Web site.

Our prediction query is useful for extracting the top prediction result for each vendor and store. We knew to restrict the query to these values because they were identified as the top factors in predicting days between shipments. If you were to browse the contents of the Analyze Shipments mining model from the data-mining Model Browser in Analysis Manager (see Figure 6.4), you would be able to drill down into a specific node path. The prediction query built in the PopulateShipmentSchedulesTable routine returns the same results as using Analysis Manager.

Figure 6.4. Screenshot of the data-mining model browser in Analysis Manager. This view is used to predict days between shipments for the node path where Store ID is 1 and the vendor name is Ables. The attributes section lists values for the four case sets along with their probability. The case with the highest probability (73%) has a value of 55, indicating that this is the predicted number of days between shipments.

Understanding the Query Results

For the Ables vendor and Store 1, Figure 6.4 lists the clustered node totals in the Attributes. Here we see that the case with the highest probability predicts that the days between shipments will be 55. This same result is returned if we execute the prediction query previously listed.

For all other stores we will ship 201 items every 111 days, except for store 3, which will ship 201 items every 80 days. Once the mining model is queried and the data is written to the ShipmentSchedules tables, we should see the following records listed first:

Shipment ScheduleID	VendorID	StoreID	DaysBetween Shipments	Quantity
1	1	1	55	225
2	1	2	111	201
3	1	3	80	201
4	1	4	111	201
5	1	5	111	201

If you were to run the following query in Query Analyzer, "SELECT DISTINCT DaysBetweenShipments FROM ShipmentSchedules," you should see these results: 22, 55, 80, and 111. This represents the clustered amounts predicted for days between shipments. The same query for quantity might reveal such values as 201, 225, and 283. The astute reader may notice that the four distinct days between shipments exactly match the same values in the global statistics section of the pmml.xml file.

The mining algorithms used by Analysis Services do not simply output averages of data values. They apply complex mathematical methods in an iterative fashion in order to cluster data into the most predictable groups. For a detailed explanation of the specifics of algorithms like decision trees, you can refer to the section in the latest MSDN library (available at msdn.microsoft.com) titled Microsoft Decision Trees.

Case Study: Web Usage Mining

Denis Susac, Webmaster and programmer for the AboutAI Web site (www.aboutai.net), used Microsoft SQL Server 2000 to add data-mining functionality to the Web site. AboutAI is a Web site dedicated to informing visitors and subscribing members about the different areas of AI. It acts as a portal and includes forums for connecting members.

The AboutAI Web site was built using Microsoft .NET and was based on the IBuySpy Portal application, available through MSDN. IBuySpy is a free ASP.NET Web development tool that Microsoft offers to give developers a head start on developing their own Web sites (www.ibuyspy.com).

Denis used Analysis Services to build mining models that examined data about Web usage. The usage data was stored in two specially created SQL Server tables named Hits and HitCount. A clustering algorithm was used to group users and Web pages into similar clusters. A decision tree algorithm was then used to make predictions about related pages that the user might like to visit. The results of the prediction were displayed to the user as suggestions.

The predictive capabilities of the system would take place each time a page was accessed on the Web site. The processing of the mining model, where historical data is fed into the mining algorithm, would take place at scheduled intervals.

I contacted Denis Susac in late 2004 to see if the Web usage mining engine was still useful to him. He informed me that the data-mining functionality was now being used inside another product, named the Mono .NET Portal Engine. The Mono .NET Portal Engine, available from http://www.mono-software.com/, is a customizable framework for building portal Web sites that is easy to use but offers a lot of functionality.

Denis feels that Analysis Services is a useful tool and had this to say about it:

I like this utility very much it requires some time to get everything organized, but when it finally starts processing usage log data, it seems almost like magic. Results were pretty good and I think it can be used in a variety of real-world scenarios, especially market basket analysis.

For more information about how Denis created his data-mining solution, you can access a detailed article he wrote for ASP Today at www.asptoday.com/Content.aspx?id=1898

< Day Day Up >