7.2 How Machine Learning Works

Machine-learning algorithms are engines of insight. They can be used by investigative data miners to predict the probability of crimes and to profile criminals from large databases into statistically significant clusters or segments. They automate the process of statistical operations and can provide a graphical breakdown of a database by mapping key clusters. Like neural networks they need to operate on samples of legal versus illegal transactions or criminals versus legitimate individuals. However, unlike neural networks, the output, in the form of either rules or graphical decision trees, is easy for humans to comprehend.

Each machine-learning algorithm operates somewhat differently on the data. The process, however, is basically the same: segment and classify the data based on a desired output. The operation is one of breaking down a data set through a process similar to the game of 20 Questions. For example, in the potential smuggler rule, a series of questions are posed against the data, so that the algorithm can begin to understand what attributes are the most important in determining a profile of a potential high alert. In this case, it is the vehicle type, along with the year of the vehicle, the lack of insurance, ownership, etc. How they interrogate the data differs somewhat; however, the end result is almost always the same. They cut down the features or conditions to a few precise clues.

Some of the most popular algorithms include CHAID, CART, and C5.0. CHAID was designed to detect statistical relationships between variables in a database and is restricted to the analysis of categorical types of data attributes, such as low, medium, and high, so that it might be used to rate the probability of a crime or the matching of a terrorist profile. CART was designed to measure the degree of diversity of variables in making its splits. CART looks to see which variable is the best splitter in a dataset. For example, it may look at all the makes of automobiles in order to find which models are most likely to be used to smuggle contraband at a border inspection point of entry. C5.0 measures the amount of information all the variables in a data set provide and ranks them in order of importance. The prominent Dr. J. Ross Quinlan developed C5.0. The algorithm is one of the first learning systems capable of generating rules in the form of a decision tree.