Section 3.1. The Procedure | Analyzing Business Data with Excel

3.1. The Procedure

First we look at the calculations. We need to make the best possible prediction, measure the accuracy, and manage anomalies. Excel provides the tools, but before we start entering formulas we need to understand the data.

3.1.1. Data

This chapter uses data from a call center. It is a five day a week operation, and the data is simply the date and number of incoming calls for each day.

The techniques used can be applied to most workload situations. The goal is to predict a periodic workload that could have an overall trend, and is subject to short-term ups and downs. The accuracy of predictions will be measured and used to set a prediction range with a known probability.

A workload model needs to handle both expected and unexpected shifts in volume. Holidays are expected, but December volumes are not a good predictor for January. Real world workloads are subject to all kinds of unpredictable outside forces. The competition can raise their price, a server can go down, or a snow storm can shut down part of the country. As a result, workloads can go up or down with no warning.

The sample data in this chapter comes from an actual call center and is subject to all the uncertainty of the business environment.

3.1.2. Predictions

Time creates uncertainty. The further into the future we predict, the less accurate we are. This chapter starts with a weekly prediction. The weekly forecast is adjusted as more information is available and becomes an adjusted daily forecast. An hourly forecast is also made, and the daily forecast is adjusted further as actual hourly values are entered.

3.1.2.1. Find the lag

Lag is the number of observations in a cycle. Many business systems run on a weekly cycle. Mondays look like Mondays; Fridays look like Fridays. If the process runs five days a week, the lag is five. Of course, everything doesn't run on a weekly cycle. Sometimes it is monthly, hourly, or some exotic period. In all cases it is critical to know how many observations there are in a cycle, and usually this is known without examining the data. But if you are not sure, it's best to check.

This example looks at thirty days of call counts for a call center. This is a Monday thru Friday operation, so we expect it to run on a five day cycle. To find the lag we correlate the daily call counts with themselves offset by different numbers of days. The offset with the highest correlation is the lag.

To do this we use the CORREL function. This function takes two arguments. They are both ranges and must have the same number of values. The function gives the correlation between the values in the ranges. The formula in the correlation column is =CORREL(B$2:B$21,B3:B22), and it is filled down for ten cells. Notice that the correlation is high only at five and ten. This confirms that the numbers have a five day cycle. Correlating a list of numbers against itself is called autocorrelation .

The procedure is illustrated in Figure 3-1.

3.1.2.2. Find the average

The easiest prediction is that each day will have the same volume as the same day in the previous week. But this leads to trouble if last week's call volume was unusual. So, it is better to use a recent average. But even the average can be skewed by a really odd day, and a filtered average tends to do the best job. In a filtered average the highest and lowest values are eliminated, and then the remaining numbers are averaged. This gives a good estimate of the true average value.

This is done in two steps. First, build a table that contains values for one weekday (e.g., a list of just Monday values). Then take the sum, subtract the maximum and minimum values, and divide by two less than the number of values.

We use five weeks of data. The first day is Monday 3/2/1998, and there are five Mondays in the list. In practice, the number of weeks giving the most accurate result varies. A balance is required between having enough data to get a good estimate and avoiding seasonal shifts. In most business situations eight weeks works well.

Figure 3-1. Using correlation to find the lag

The INDEX function creates a list of values for one weekday by looking up values in a range. The first argument is a range of cells containing a list of values. The second argument is a number that tells which item in the list is wanted. In this case, we need the row numbers of the Mondays. The first Monday is in row 2, and the first value in the Rows column, cell C2, is 2. The formula in C3 is =C2+5. We add five because that is lag. This formula fills down to C6. In D2 the formula is =INDEX(B$1:B$26,C2). This equates to the second item in the list, B1:B26. This formula also fills down. The formula for the filtered average is =INT((SUM(D2:D6)-(MAX(D2:D6)+MIN(D2:D6)))/3). Five items are being used, but the highest and lowest are eliminated, so we divide by three. The INT function returns the value as an integer. We are dealing with calls, therefore it makes sense to work with integers rather than real numbers. There is no such thing as half a call! In the application, the filtered average is the weekly prediction. Notice that this technique gives us a prediction that is one week in the future. We predicted March 8th on March 1st.

Figure 3-2 shows the calculations.

Figure 3-2. Calculating the filtered average

3.1.2.3. Adjust for the trend

Trend is the change in average over time. We might expect sales to go up year over year, or complaints to go down. But when the focus is on short-term predictions, these trends often do not mean much. A five percent year-over-year growth rate is less than one tenth of a percent per week. Business systems are not predictable within a range of a tenth of a percent. Consequently, this kind of trend adds no value to the forecast. There are other factors at work, however. In any operation volumes go up and down from week to week. These short-term trends have a significant impact on accuracy of the forecast.

Understanding the trend is essential to building an accurate forecasting model. We calculate the ratio of the predictions (filtered average ) to the actuals. In Figure 3-3 the errors are not random. The predictions run low for a few days then high for few days. There is no real trend, just oscillating high and low periods.

Figure 3-3. Analyzing the errors

This means that accuracy can be increased by adjusting based on the previous day's error. If the errors showed significant motion in one direction, it would mean a long-term trend is present. This would require a different approach, such as using a week-over-week growth rate. If the errors have no pattern, it is best to forget the trend and simply use the filtered average as the prediction. In this case we will adjust the filtered average by one half of the error ratio for the previous day. This moves the prediction in the right direction most of the time without overreacting.

Next we calculate the daily adjusted prediction. This value takes the current trend into account and provides improved accuracy. In Figure 3-4 the formula for the adjusted prediction (in D17) is =C16*(1+(B16/C16))/2). This formula multiplies C16 by the value (1+B16/C16))/2), in which B16/C16 is the ratio of the actual and the predicted. But we only want to use half of the ratio. Therefore, the formula averages the ratio with one plus the ratio divided by two. Note that overall the adjusted prediction is 10% more accurate than the filtered average.

The formula for the average error for the Adjusted column is {=AVERAGE(ABS(B3:B16-D3:D16))}. This is an array formula. It creates a vector (a list) of the absolute values of the differences between the actual and predicted. It returns the average of these values. In effect this is the same as creating a new column with =ABS(B3-C3) filled down to row 16, and then taking the average. By using the array formula we get the same answer without adding 14 unnecessary formulas to the worksheet.

Figure 3-4. Making a better prediction

3.1.3. Determine the Confidence Interval

The average number of calls per days is about 14,300. Therefore, an average error of 520 equates to around 3.5%. This means that the prediction is really a range.

On the Settings worksheet the user can enter a value for Confidence Level. The application will give the range for the adjusted prediction at that probability. If the confidence level is set at 0.9, the application will display the prediction and a +/- range. There is a 90% probability that the actual value will be in that range.

In Figure 3-4, the adjusted prediction for the next day is 14,630. There is a 90% probability that the actual value will be 14,630 +/- 305. The array formula for this calculation is ={CONFIDENCE(0.1,STDEV(D3:D16-B3:B16),15)}. The CONFIDENCE function returns the confidence interval . It takes three arguments. The first is the desired confidence level. This is entered as the amount of expected error, so if you want a confidence level of 0.9 the entry is 1-0.9 or 0.1. Next is the standard deviation. In this case it is the standard deviation of the difference between the actual and the adjusted prediction. This is calculated as part of the formula and accounts for this being an array formula. The last argument is the number of values being used. There are 15 values.

3.1.4. Manage Anomalies

Holidays are a problem, as are any large short-term shifts in volume. Not only are they hard to predict, but since the techniques used in this chapter depend on the past, unusual days in the past make accurate prediction difficult. Filtered averaging helps take care of the normal ups and downs. But there are events that are so large and abnormal that they need to be eliminated from the data. Every year, the last two weeks of December are apt to be like this.

When an actual value is entered, the application looks at the prediction error and calculates the probability that the error amount is too great to be part of a normal distribution of errors. This assumes errors are normally distributed.

In the Figure 3-5, the actual value for 06/17/98 has just been entered. The average error amount over the last 15 days was 520.98. The error amount for 06/17/98 is 538. To determine the probability that this value is an anomaly we need the standard deviation of the recent errors. This is calculated by the array formula {=STDEV(D3:D16-B3:B16)}. Here again, we use an array formula to get the answer without creating an additional column of calculations. It is in cell D23. The array formula in D24 is ={AVERAGE(D3:D16-B3:B16)}, which gives the average error. This is different from the value in D19, which is the average error amount, and is based on absolute value.

Next we need to know how many standard deviations from the mean the current error is. The STANDARDIZE function gives this value. The value B17-C17 is the current error. D24 is the average calculated above. And, D23 is the standard deviation from above. The formula =ABS(STANDARDIZE(B17-C17,D24,D23)) in D25 tells how many standard deviations from the mean the current error is.

We need to know what percentage of the distribution is closer to the average than the current error. This will let us calculate the probability that the current error is too large. Using the value returned by the STANDARDIZE function, the formula in D26, which is =NORMSDIST(D25), gives the portion of the distribution between the mean and the value in D23. In the figure, 62% of the population of errors is less than 0.32 standard deviations from the mean.

A forecast can be high or low; therefore, the distribution of errors has two tails. So, the formula in D27, which is =(D26-0.5)*2, gives the final answer. We subtract 0.5 because the NORMSDIST function only considers one tail of the distribution, and multiply by two because the error can occur at either end of the distribution. In this case, only 24% of errors are expected to be smaller than 538. Today is normal. The calculations to do this are shown in Figure 3-5.

If the probability of an anomaly is too high, the value for that day cannot be used for predictions. In the upcoming application, on the Settings worksheet, the user can set a value for Anomaly Detection. If the probability of an anomaly is higher than this setting, the application substitutes the value for the same day in the previous week in all calculations.

Figure 3-5. Identifying anomalies