Benchmark estimation models, such as those considered here, may also be called frontier regression models. The general application of these in data mining has been discussed in Troutt, Hu, Shanker, and Acar (2001). They are formed to explain boundary, frontier or optimal behavior rather than average behavior as, for example, in ordinary regression models. Such a model may also be called a ceiling model if it lays above all the observations or a floor model in the opposite case. The cost-estimation model of this chapter is a floor model since it predicts the best, i.e., lowest, cost units.
The model considered here is a cross-sectional one. Although data mining is ordinarily thought of from the perspective of mining data from within a single organization, benchmarking type studies must often involve comparisons of data across organizations. Benchmarking partnerships have been formed for this purpose as discussed in Troutt, Gribbin, Shanker, and Zhang (2000). Such benchmarking-oriented data mining might be extended in a number of directions. Potential applications include comparisons of quality and other costs, processing and set-up times, and employee turnover. More generally, benchmarking comparisons could extend to virtually any measure of common interest across firms or other entities, such as universities, states, and municipalities. In the example in this chapter, a simple cost model was used to explain best practice performance. More generally, best practice performance may depend on other explanatory variables or categories of interest to the firm. Discovery of such models, variables, and categories might be regarded as the essence of data mining. With techniques discussed here, the difference is the prediction of frontier rather than average performance. For example, interest often centers on best instances such as customers most responsive to mailings, safest drivers, etc.
However, cross-sectional applications of benchmark performance models do not necessarily depend on the multiple firm situations. Mining across all a firm's customers can also be of interest. Consider a planned mail solicitation of a sales firm. For mailings of a given type, it is desirable to predict the set of most responsive customers so that it can be targeted. Similarly, a charitable organization may be interested in discovering how to characterize its best or worst supporters according to a model.
As noted above, under the topic of time-series data, such frontier models can be used within a single organization where the benchmarking is done across time periods. Models of this type might be used to mine for explanatory variables or conditions that account for best or worst performance periods. Some of the same subjects as noted above for crosssectional studies may be worthwhile targets. For example, it would be of interest for quality assurance to determine the correlates of best and worst defect production rates.
Ordinary regression is one of the most important tools for data mining. Frontier models, such as considered here, may be desirable alternatives in connection with data-mining applications. This is especially the case when it is desired to characterize and model the best and/or worst cases in the data. Such data are typically of the managed kind. In general, such managed data or data from purposeful or goal-directed behavior will be amenable to frontier modeling.