2.2 Data Testing
It is highly recommended that the initial analysis start with a subset of the entire data that is available. Using this subset an initial model can be constructed and tested on a specific segment to evaluate its functionality and accuracy. Start the project by constructing a model with a small sample of a database, rather than the entire population. Tests can be conducted using a particular region, date segment, district or office, dollar range, region, and the like. So, one of the first decisions will be what segments and samples of the entire data set will be used for the initial analysis and testing.
For example, to detect and profile vehicles likely to be used for smuggling contraband or weapons, an initial analysis can be started with the data from a single point-of-entry or limited to trucks only. Once the initial model has been developed and tested on this specific segment, then the project can be expanded so that multiple models, if necessary, can be developed to cover jurisdictions across an entire department or agency or, as in this case, to cover all of the points-of-entry along a border for all types of vehicles.
In data mining, it is important to start with a clear objective. This will guide the project and lead to the selection of the data that will be accessed and used. To a very large extent, the success of any data mining project depends on the quality of the data. Once the data can be accessed or is received, the next challenges are its integration and preparation for mining and modeling for purposes of configuring a composite of individuals and companies and analyzing them for investigative applications. There are commercial, financial, medical, demographic, utility, telecom, real estate, vehicle, licensing, credit, criminal, Internet, retailing, etc., data sources, as well as tools for preparing and integrating them. Unfortunately, data is usually housed in databases for applications other than data mining; it is commonly stored for processing, billing, tracking, and reporting. Seldom is the data created with the intent of modeling and analysis. There are many sources of information on individuals and companies and many formats that this data is likely to be in.
2.3 The Data Warehouse
The concept of data warehousing—that is, assembling a cohesive view of customers from multiple internal databases coupled with external demographic data sources—has been an accepted practice for several years by large companies, especially retailers. The idea of the data warehouse is to have a multidimensional picture of customers, mixing information about their spending habits with insightful lifestyle demographics. While the concept of this type of consumer data warehouse is not directly applicable to law enforcement and counter-intelligence, its data architecture does have merits: the assembling of information about individuals from disparate databases into a composite to gain a comprehensive view of their identities and behaviors.
The most common analyses that data warehouses in the private sector are subject to are online analytical processing (OLAP) and data mining. OLAP tools are used to extract data cubes, which are reports segmenting customer or sales information by area—for example by zip code, city, state, and region. They are a fairly straightforward, analysis-driven type of reporting. While OLAP reports are valuable in summarizing of customer activity, data mining is more valuable because it often identifies the hidden patterns of customer behavior.
The ability for companies to use these types of analyses on their data warehouses has led to the practice of customer-relationship management (CRM). In CRM, firms integrate all point-of-contact customer data, including Web site forms, e-mail, dealership sales data, phone call site data, and transactional data, in order to provide better service and retain their customers. While the concept of CRM also does not apply to law enforcement either, the lessons about integrating data from multiple sources in order to assemble a picture of an individual is applicable, because, again, a cohesive view of perpetrators and suspects can be obtained.
September 11 demonstrated the need to share and access multiple data sets containing critical strategic information, as well as to be more effective in the use of data mining techniques normally used for profiling individuals in marketing, call centers, insurance, telecommunications, utilities, retailing, and e-commerce. The same type of CRM analysis, which uses data warehousing and analytic techniques, can be applied to counter-intelligence and criminal detection applications. This is not to suggest the use of the simplistic type of racial profiling that has been used in the past, but a more effective methodology of using data mining as a modeling tool for sorting through vast databases to identifiy perpetrators based on behavioral patterns and socioeconomic, Internet, consumer, credit, criminal, lifestyle, and other commercial and government data sources.
As was mentioned in the preceding chapter, individuals cannot exist without leaving a trail of digital data in commercial and government databases and online and offline information. Appendix A includes a partial listing of several hundred Web sites that provide links to some of these files. However, the sites listed in Appendix A are just a start; there are many more potential data sources for enhancing the value of an investigative data mining analysis. Users of data mining tools and techniques from industries in financial services, retailing, marketing, and the like have long employed the concept of overlaying information about their customers and prospects with external lifestyle, socioeconomic, and demographic data.
For example, an e-commerce site can mine not only the clickstream data of its most loyal and profitable online customers, but it may also look at their zip-code and geo-code demographics in an attempt to obtain a profile about them. It can also look at the geo location of their Internet provider address. Using a similarly method, perpetrators may be profiled via data appends from diverse and unrelated databases. Unexpected results may occur when this is done; for example, the German authorities used utility-power usage records to identify potential dormant terrorists: foreign students who rented (safe) houses and used no electricity.