It is highly recommended that the initial analysis start with a subset of the entire data that is available. Using this subset an initial model can be constructed and tested on a specific segment to evaluate its functionality and accuracy. Start the project by constructing a model with a small sample of a database, rather than the entire population. Tests can be conducted using a particular region, date segment, district or office, dollar range, region, and the like. So, one of the first decisions will be what segments and samples of the entire data set will be used for the initial analysis and testing.
For example, to detect and profile vehicles likely to be used for smuggling contraband or weapons, an initial analysis can be started with the data from a single point-of-entry or limited to trucks only. Once the initial model has been developed and tested on this specific segment, then the project can be expanded so that multiple models, if necessary, can be developed to cover jurisdictions across an entire department or agency or, as in this case, to cover all of the points-of-entry along a border for all types of vehicles.
In data mining, it is important to start with a clear objective. This will guide the project and lead to the selection of the data that will be accessed and used. To a very large extent, the success of any data mining project depends on the quality of the data. Once the data can be accessed or is received, the next challenges are its integration and preparation for mining and modeling for purposes of configuring a composite of individuals and companies and analyzing them for investigative applications. There are commercial, financial, medical, demographic, utility, telecom, real estate, vehicle, licensing, credit, criminal, Internet, retailing, etc., data sources, as well as tools for preparing and integrating them. Unfortunately, data is usually housed in databases for applications other than data mining; it is commonly stored for processing, billing, tracking, and reporting. Seldom is the data created with the intent of modeling and analysis. There are many sources of information on individuals and companies and many formats that this data is likely to be in.