2.11 Data Preparation

The following are some basic steps and issues likely to be considered during the course of normalizing a data set prior to analysis; they are by no means all-inclusive. Adequate time needs to be provided for this process, as it is not uncommon for data preparation to take up to 80% of the effort during a data mining project:

Clean: Prior to any data analysis, the removal of extraneous characters from the data stream might be required, such as the removal of ($) from a string field that represents an actual monetary variable, which will be analyzed as a numeric value. Other characters warranting similar cleaning and removal include @, #, %, *, <, >, " as well as such strings as Mr., Mrs., unk, n/a, etc.
Convert: The data needs to be converted into a standardized format, (e.g., YYYYMMDD for dates) so that other temporally related fields can be compared. The objective is to have a single uniform standard of conversion for fields that represent currency, weight, distance, time, and other methods of measurement and comparison. Some algorithms, such as CART, can work only with numeric fields; this may require converting string fields, such as MALE to 1, FEMALE to 0, or LEGAL to 0, FRAUD to 1, etc.
Concatenate: Transactional data is usually organized and stored in discrete fields, such as lastname, firstname, address, city, state, and zip. These units can be combined into a single data object that represents an individual or suspect. In some instances, some of these fields can be excluded altogether. For example, in most cases, the price of a ticket purchased has no measurable impact on a profile or a model.

As part of the data preparation process, the data will need to be split into training and testing data sets when you are constructing and evaluating a predictive model. Most of today's data mining tools use a random number selection feature, which automates this process. The idea of course is to construct a model using a subset of the data to train it and then use an unseen section of the data to test it. Moving the data from original sources to the system used for preparing and converting the data for mining should be identified, and the whole process should be well documented.

As part of this process, ad hoc standard query language (SQL) extracts from relational databases of host systems may be performed, developing discrete tables for the actual analysis. In some instances the assistance of a database administrator (DBA) will be available or required. Most mature data mining tools support various data formats or allow for ODBC connection to various tables. As part of the data preparation process, an analyst or a team of analysts may first generate a summary count of every value of every field in a file. This may require simply loading the data into a spreadsheet such as Excel or a statistical software package such as SAS or SPSS. It is important to remember that numbers are not always numeric, such as zip or vehicle ID numbers. They represent category values, which at times can be clustered into segments of pseudo-categories, such as suspects versus nonsuspects, or low, moderate, or high alert.

Consideration of the structure of the data is a very important issue that may influence how the data mining process will evolve. A study (STATLOG) conducted by several universities in multiple countries during the 1990s with a diverse number of data sets from different industries was designed to compare the accuracy of various classification algorithms, such as regression, neural networks, machine-learning algorithms, etc. The four-year study concluded that the structure of the data was the most critical factor in determining the most accurate classification system. The STAGLOG findings can be summarized to these two observations: Use a symbolic classifier, such as CART or other machine-learning-algorithm—based tools if categorical fields outnumber the continuous value fields in a file, and use neural networks if continuous value fields outnumber categorical fields.