Analysis Manager (see Figure 5.3) allows you to create and manage databases used by Analysis Services. A database node in Analysis Manager does not represent the physical storage of a large database. Instead, it represents the database that Analysis Services will use to hold the mining-model definitions and the results of processing these mining models. It will, however, reference a physical data source.
Each database in Analysis Manager is associated with one or more data sources. These data sources can be either relational databases or data warehouses. Data sources are created by right-clicking the
Data Sources
node and selecting
New Data Source
. From the
Data Link Properties
dialog, a data provider is selected along with the connection information. Analysis Services supports SQL Server, Access, and Oracle databases as data sources.
Once the model is processed, the data associated with the mining model represents what was learned from the data. The actual data from the training dataset is not stored in the Analysis Services database. The results of analyzing that data, however, are saved.
If you have not already done so, you will need to install Analysis Services. It is available as a separate install with the SQL Server 2000 setup. Make sure that you install the latest Service Pack release for SQL Server. Information about how to obtain the latest SQL Server service pack can be found at http://support.microsoft.com/default.aspx?kbid=290211.
If you have problems connecting with Analysis Services, refer to the article by Alexander Chigrik titled, "Troubleshooting OLAP Problems" at http://databasejournal.com/features/
mssql
/article.php/1582491. This online article, featured in the Database Journal, is a troubleshooting checklist for solving common problems.
To begin,
open
Analysis Manager
from the
Analysis Services
menu item. You will then need to create a new database and specify the data source by executing the following steps:
|
1.
|
Right-click the server
name
in the left-hand pane and select
New Database. . .
|
|
2.
|
Specify 'SavingsMart' as the Database Name and click
OK
|
|
3.
|
Expand the newly added SavingsMart node, right-click Data Sources, and select
New Data Source. . .
|
|
4.
|
From the Data Link Properties dialog box, select Microsoft OLE DB Provider for SQL Server and click
Next
|
|
5.
|
Enter the SQL connection information for your SQL Server and test the connection before closing the Data Link Properties dialog
|
The next thing to do is create the mining model using the mining-model wizard. To do so, execute the following steps:
|
1.
|
Right-click Mining Models in the left-hand pane and select
New Mining Model. . .
|
|
2.
|
Click
Next
on the Welcome Dialog
|
|
3.
|
Click
Next
on the Select Source Dialog because we will be using Relational Data
|
|
4.
|
Select the vw_Shipments view from the Available Tables list box in the Select Case Tables dialog and click
Next
. It would have been possible to select multiple tables, but
utilizing
the view allows access to a calculated field indicating the number of days between shipments.
|
|
5.
|
Click
Next
to accept the default of Microsoft Decision Trees as the data-mining technique.
|
|
6.
|
Click
Next
to accept ShipmentID as the default Case Key Column.
|
|
7.
|
Select the
Finish the mining model in the editor
checkbox and click
Next
.
|
|
8.
|
Name the model "Analyze Shipments" and click
Finish
.
|
|
9.
|
From the Relational Mining Model Editor, as seen in Figure 5.4, click
Insert
and
Column. . .
and then select the column named DaysSinceLastShipped. Once added, change the usage to Input and Predictable (note that a diamond icon now appears next to the column). Then go to the
Advanced Properties
and enter DISCRETIZED(CLUSTERS) as the content type.
|
|
10.
|
Click
Insert
and
Column. . .
and then select the column named StoreID. Once added, change the usage to Input and Predictable.
|
|
11.
|
Click
Insert
and
Column. . .
and then select the column named Quantity. Once added, change the usage to Predictable, and from the Advanced Properties tab, enter DISCRETIZED(CLUSTERS) as the content type.
|
|
12.
|
Click
Insert
and
Column. . .
and then select the column named VendorName.
|
|
13.
|
Click
Insert
and
Column. . .
and then select the column named ProductType.
|
|
14.
|
Click
Tools
and
Process Mining Model. . .
Click
OK
when asked to save the mining model. Then click OK to start a full process of the mining model. This process will take several minutes to run if you loaded data for all five stores. When complete, the message "Processing Complete" will appear in green text.
|
Note
Choosing discretized as the content type allows a continuous variable to be grouped discretely instead. Continuous
variables
are usually numeric-based values that have an infinite range of possibilities. Since we need predictable results, utilizing a discretization method allows for grouped results. DISCRETEIZED accepts two parameters, such as:
DISCRETIZED(<method>, <#buckets>)
where method could contain one of the following values:
EQUAL_AREAS—Divides into equal buckets
THRESHOLDS—Uses
inflection
points to estimate bucket boundaries
CLUSTERS—Uses a clustering algorithm to estimate buckets
AUTOMATIC (default)—Tries all algorithms and uses the first one that suggests number buckets
Training the Mining Model
Training the mining model is accomplished by processing the results of a mining model using Analysis Manager. Alternatively, the same thing could be accomplished using a scripting language known as Data Definition Language (DDL) and a connection to the Analysis Server. We can see what DDL commands are used to train the model through the Process dialog box, as shown in Figure 5.5.
DDL is useful in cases when you want to programmatically process a mining model. The language can be executed through a connection to the Analysis Server. It is also useful for demonstrating how Analysis Manager processes a mining model.
A mining model is created using the CREATE MINING MODEL syntax. The syntax is similar to Transact SQL and should be instantly familiar to SQL developers. The CREATE statement for this mining model is as
follows
:
CREATE MINING MODEL [Analyze Shipments](
[Shipment Id] LONG KEY,
[Days Since Last Shipped] LONG DISCRETIZED(CLUSTERS) PREDICT,
[Store Id] LONG DISCRETE,
[Quantity] LONG DISCRETIZED(CLUSTERS) PREDICT_ONLY,
[Vendor Name] TEXT DISCRETE,
[Product Type] TEXT DISCRETE)
USING Microsoft_Decision_Trees
With the
preceding
statement, we are creating a new mining model named Analyze Shipments. The model utilizes Shipment ID as the case key. Days Since Last Shipped, and Quantity are each defined as predictable columns, but Days Since Last Shipped also functions as an input column. The remaining columns, Store ID, Vendor Name, and Product Type, are input columns only. Mining-model columns are defined as either input, predictable, or input and predictable.
The process of training a model involves the insertion of data into the mining model using the INSERT INTO syntax, as follows:
INSERT INTO [Analyze Shipments]
(SKIP,[Days Since Last Shipped], [Store Id], [Quantity],
[Vendor Name], [Product Type])
OPENROWSET('SQLOLEDB.1','Provider=SQLOLEDB.1;Integrated
Security=SSPI;Persist Security Info=False;Initial
Catalog=SavingsMart;Data Source=(local)',
'SELECT DISTINCT "dbo"."vw_Shipments"."ShipmentID"
AS "Shipment Id", "dbo"."vw_Shipments"."DaysSinceLastShipped"
AS "Days Since Last Shipped", "dbo"."vw_Shipments"."StoreID"
AS "Store Id", "dbo"."vw_Shipments"."Quantity" AS "Quantity",
"dbo"."vw_Shipments"."VendorName" AS "Vendor Name",
"dbo"."vw_Shipments"."ProductType" AS "Product Type"
FROM "dbo"."vw_Shipments"')
The mining model will not store the actual data, but will store the prediction results instead once the mining algorithm is processed. In the preceding statement, the OPENROWSET keyword was used to specify the location of the physical data source.
Interpreting the Results
To examine the results from processing the model, select the
Content
tab. Figure 5.6 is a screenshot of the content detail when analyzing DaysSinceLastShipped. This screen indicates that VendorName was the most significant factor
affecting
DaysSinceLastShipped. We know this because it is the first split on the tree. For nodes that have additional branches, two lines will follow the node. To view the additional branches, double-click that node and the detail page will drill down to the next level.
The
Content Navigator
box—seen in the top-right corner—offers an easy way to see all the mining-model results and drill down into a certain
path
. The
Attributes
box shows the totals associated with each node, grouped according to a clustering algorithm. In Figure 5.6, the cursor is selecting the outermost node labeled All. In this example, the attributes are shown for all the cases analyzed.
Note
If you did not attach the database file and instead loaded the data using the LoadSampleData program, you will encounter slightly different statistical results. The results presented in this section are specific to the database file available on the book's Web site.
If you attached your database using the file provided, your processing results should be the same as the ones we are about to interpret. The first thing to notice is that the darkest-shaded node is the one where the Vendor Name is Store Brand. Nodes that resulted in a higher data density, or more cases analyzed, will be shaded in a
darker
color
. This result is not surprising, because 127 of the 500 products available, or 25 percent, are represented by the Store Brand. This can be confirmed in
Query Analyzer
with the following query:
SELECT v.VendorName,(COUNT(ProductID)/500.0) AS 'Percent'
FROM Products p
LEFT JOIN Vendors v ON p.VendorID = v.VendorID
GROUP BY v.VendorName
ORDER BY 'Percent' DESC
If the Store Brand node is double-clicked, the detail pane will show the next branching of the tree (see Figure 5.7). For the Store Brand Node, the first branching distinguishes between the different stores. If we click on the node Store ID = 2 and look at the attributes, the value with the highest probability is 119.33. This indicates that for all products where the Vendor name is Store Brand and the Store ID is 2, it is highly probable that there should be 119 days between shipments.
If we examine the attributes for the remaining nodes, we will see that predictions can be made for all the stores. For Store 1, there is one additional branching that distinguishes between a product type of Snack Foods versus all other product types. When the Store ID is 1, vendor name is Store Brand, and product type is snack foods, there is a 58 percent probability that there will be 60 days between shipments. When we examine the attributes where product type is not
snack
foods
, there is a 43 percent probability that there will be 119 days between shipments and a 53 percent probability that there will be 85 days between shipments. In this case, we could say that the 53 percent probability wins the toss, but that might not always be the best decision. This will be discussed further in the next chapter.
If you use the drop-down box above the Content Detail pane named
Prediction Tree
to select the quantity column, you will see that the main factor affecting quantity is the days since last shipped (see Figure 5.8). This is possible because the column DaysSinceLastShipped was defined as an input and a predictable column.
The next factor affecting quantity is the vendor name. In the case where the vendor name is NOT Kamp, Store ID is an additional factor. In Figure 5.8 we can see that when the days since last shipped is less than or equal to 48 and the Store ID is NOT 1 and the vendor name is NOT Kamp, there is a 98 percent probability that the quantity should be 200. When the Store ID is equal to 1, the prediction
drops
to a 72 percent probability that the quantity will be 200.
The next chapter will involve interpreting the results from the mining model and then applying the predictions to a new shipment strategy. The goal of the new shipment strategy will be to reduce Savings Mart's operational costs by reducing the total number of shipments.