9.9 A Data Mining Methodology for Detecting Crimes

The following methodology is proposed for detecting and deterring these various types of digital and entity crimes, not only for the industries covered in this chapter, but for those in other market sectors. First, in order to be cost-effective and accurate, the methodology should adhere to the CRoss-Industry Standard Process-Data Mining (CRISP-DM). CRISP-DM is a nonproprietary, documented, and freely available data mining model methodology.

CRISP-DM encourages best practices and offers a set structure for obtaining better and faster results from data mining. The CRISP-DM methodology was developed several years ago by a consortium of companies and organizations with the idea of standardizing some of the data mining processes for multiple and diverse data mining objectives, including crime detection.

CRISP-DM divides the life cycle of a data mining project into six major phases. The sequence of the phases is not strict. Moving back and forth between different phases is always required. The outcome of each phase determines what phase will follow or which particular task of a phase needs to be performed next. The arrows in the CRISP-DM diagram in Figure 9.1 indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of the data mining process. The concept of CRISP-DM is that data mining is a process, which continues well after a solution has been deployed. The lessons learned during this process could trigger new and often more focused investigative questions and queries, leading to subsequent data mining processes, which will benefit from the experiences of previous ones.

click to expand
Figure 9.1: The CRISP-DM process.

The objective of CRIP-DM is to establish a data mining standard process that is applicable in diverse industries, including criminal detection, with the objective of making data mining projects faster, more efficient, more reliable, more manageable and less costly. The CRISP-DM model defines the following data mining processes.

One: Understand the Investigation's Objective

Understand the insight or outcome sought by the agency, department, or business. During the course of the investigative data mining project, the analysts need to understand the type of crime they are trying to detect, the costs of positive and negative predictions, and the enforcement and preemptive actions that can be taken in each case. This initial phase focuses on understanding the project objectives and requirements from a business or law enforcement perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve these objectives. The overall costs and benefits of the entire data mining project should be estimated and quantified.

Two: Understand the Data

Understand the crime that needs to be detected. This will ensure that the appropriate data will be collected and used to achieve a strong prediction. The data-understanding phase starts with an initial data collection and proceeds with exploratory activities necessary to get familiar with the data, to identify data-quality problems, to gain an initial insight into the data or to detect interesting subsets, and to form hypotheses from the hidden information. During this phase, consideration is given to the quality of the data and how that will impact the results obtained. Consideration is also given to how we will access the data and understand confidentiality and privacy issues. At this juncture, considering appending additional information, such as demographics or data from another governmental agency or commercial database, may be warranted.

Three: Data Preparation

The data-preparation phase covers all related activities needed to construct the final dataset, the data that will be fed into the data mining tool(s) from the initial transactional raw data. Data-preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools. At this phase, data-quality issues must be addressed, a determination on how much data will be needed must be made, and in what format. For example, decisions need to be made on how to handle missing values, which in fraud detection takes on an especially important dimension. A process and a plan on how to obtain and prepare the data in the most efficient way possible must be made at this stage, with consideration given to other applications accessing the same data, and to server and network restrictions.

Four: Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements, depending on the structure of the data; therefore, stepping back to the data-preparation phase is often needed. For example, some algorithms, like CART, can work only with numeric data. During this phase, the construction of multiple models should take place to compare error rates. For crime detection it is essential that a number of techniques and models be employed and used in cooperation, via an ensemble—this is fundamental to a viable methodology and success in acheiving a cost-effective solution.

Five: Evaluation

At this stage in the project, the models have been constructed and appear to have a high degree of quality from a data-analysis perspective. However, before proceeding to final deployment of the models, it is important to evaluate them thoroughly and review the steps executed to construct them to be certain they properly achieve the detection objectives. A key objective is to determine if there is some important issue that has not been sufficiently considered, such as a high number of false positives, which can impact the total cost of deployment. At the end of this phase, a decision on the use of the data mining results should be reached. The entire process is iterative, and this evaluation phase should ensure and validate the results before final deployment.

Six: Deployment

The deployment of the models is often the neglected stage in most data mining projects. For crime detection, this phase requires continuous learning, automated monitoring and evaluation, and instantaneous refreshing of new models to capture the ever-changing characteristics of criminal avoidance. Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that it can be used in a production environment. Code or rules may need to be exported into a production system, such as a call site, intranet, network, servers, or Web site. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process with streams of code generated on a regular basis.