Section 4.5. Building the Model


4.5. Building the Model

We need to limit the number of metrics. If we use too many, the model will over-train. If this happens, the model will be overly influenced by unusual or isolated data. There might be one or two dogs that win with a very high payout. Two out of seven hundred doesn't mean much. But regression is not a magic formula, and it is not guaranteed to find the relationships in the data. It is just a mathematical technique that draws a bunch of lines based on the best fit to the data. With too much flexibility it will, in effect, memorize the data rather than learn how to solve the problem.

To guard against this we test the results. The data is separated into two groups. One group is used to build the model and the other is used for testing. If we get good results when we build the model but worse results when we test, the model is over-trained and useless.

We have 710 data items. We will use 449 to build the model and the remaining 261 will be reserved for testing. We have already used WinCnt to limit the data. We now build a worksheet with just the columns needed for the model. We know that Dog (running position), SpeedRank, and WinRank are the metrics to use. But there is a problem with Dog. In Figure 4-16 we see that Dog produces two distinct profitable regions. This seems to mean that the middle positions are less profitable than one, two, seven, and eight. Since we know that this situation exists, we should make a change in the Dog metric. Figure 4-19 shows the resulting sheet.

Figure 4-19. Model data


We insert a new column named Dog-4 containing the absolute value of the difference between the dog's running position and 4.

We need to be sure the rows are assigned to the model and test groups randomly. Therefore, we add a new column called Random and fill it with random numbers. Next we sort the data on Random. This will ensure that each row has the same chance of being assigned to the model group or the test group. After sorting on the Random column, it is deleted.

We start the model like we did the stock example in Figure 4-2. The resulting sheet is shown in Figure 4-20.

Figure 4-20. Regression setup


The calculations in columns F-I could be handled in a single column using the SUMPRODUCT function, or even reduced to a single cell using an array formula. But, keeping the calculations separate makes this process easier to understand. We have 710 rows of data but in the Input Ranges we only use rows 2450. The Y Range is PlacePay in column E. The regression output is shown in Figure 4-21.

Figure 4-21. Regression output


In Item 1, the overall performance of the model is low. In general, the metrics are not great at predicting the payout. But R Square measures the model's performance across all 449 rows. We are interested in setting a threshold that divides the dogs into two groups, and only one group has to be profitable. The model can do this without a complete understanding of the relationships. A high value for R Square would be better, but this may be good enough.

The P-values in Item 2 for Dog-4 are encouraging.

Item 3 gives the weights and intercept. We copy and paste them into the model resulting in Figure 4-22. The values in column I are the scores. A high value in column I means our model predicts that the dog is more likely to be a profitable bet.

Is it? We find out by testing. We need to know if there is a score above which we can bet profitably. We used 449 rows to set the weights, so there are 449 scores to consider. Perhaps the top half is profitable.

To make testing easy, we start by creating logic to analyze the model's performance.



Analyzing Business Data with Excel
Analyzing Business Data with Excel
ISBN: 0596100736
EAN: 2147483647
Year: 2006
Pages: 101
Authors: Gerald Knight

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net