1.9 Machine Learning

Probably the most important and pivotal technology for profiling terrorists and criminals via data mining is through the use of machine-learning algorithms. Machine-learning algorithms are commonly used to segment a database—to automate the manual process of searching and discovering key features and intervals. For example, they can be used to answer such questions as when is fraud most likely to take place or what are the characteristics of a drug smuggler. Machine-learning software can segment a database into statistically significant clusters based on a desired output, such as the identifiable characteristics of suspected criminals or terrorists. Like neural networks, they can be used to find the needles in the digital haystacks. However, unlike nets, they can generate graphical decision trees or IF/THEN rules, which an analyst can understand and use to gain important insight into the attributes of crimes and criminals.

Machine-learning algorithms, such as CART, CHAID, and C5.0, operate somewhat differently, but the solution is basically the same: They segment and classify the data based on a desired output, such as identifying a potential perpetrator. They operate through a process similar to the game of 20 questions, interrogating a data set in order to discover what attributes are the most important for identifying a potential customer, perpetrator, or piece of fruit. Let's say we have a banana, an apple, and an orange. Which data attribute carries the most information in classifying that fruit? Is it weight, shape, or color? Weight is of little help since 7.8 ounces isn't going to discriminate very much. How about shape? Well, if it is round, we can rule out a banana. However, color is really the best attribute and carries the most information for identifying fruit. The same process takes place in the identification of perpetrators, except in this case an analysis might incorporate hundreds, if not thousands, of data attributes.

Their output can be either in the form of IF/THEN rules or a graphical decision tree with each branch representing a distinct cluster in a database. They can automate the process of stratification so that known clues can be used to "score" individuals as interactions occur in various databases over time and predictive rules can "fire" in real-time for detecting potential suspects. The rules or "signatures" could be hosted in centralized servers, so that as transactions occur in commercial and government databases, real-time alerts would be broadcast to law enforcement agencies and other point-of-contact users; a scenario might be played as follows:

An event is observed (INS processes a passport), and a score is generated:

    RULE 1:    IF social security number issued <= 89–121 days ago,    THEN target 16% probability,    Recommended Action: OK, process through.

However, if the conditions are different, a low alert is calibrated:

    RULE 2:    IF social security number issued <= 89–121 days ago,    AND 2 overseas trips during last 3 months,    THEN target 31% probability,    Recommended Action: Ask for additional ID, report    on findings to this system.

Under different conditions, the alert is elevated:

    RULE 3:    IF social security number issued <= 89–121 days ago,    AND 2 overseas trips during last 3 months,    AND license type = Truck,    THEN target 63% probability,    Recommended Action: Ask for additional information    about destination, report on findings to this    system.

Finally, the conditions warrant an escalated alert and associated action:

    RULE 4:    IF social security number issued <= 89–121 days ago,    AND 2 overseas trips during last 3 months,    AND license type = Truck,    AND wire transfers <= 3–5,    THEN target 71% probability,    Recommended Action: Detain for further    investigation, report on findings to this system.

Presently, all of this information exits: it is sitting idly in the government databases from the Social Security Administration and the Departments of State, Transportation, and the Treasury. Obviously the future of homeland security is going to require the application of data mining models in realtime, utilizing many different databases in support of multiple agencies and their personnel. Already the Visa Entry Reform Act of 2001 is addressing the modernization of the U.S. visa system in an effort to increase the ability to track foreign nationals. Amazingly, in the summer of 2000 full year before the attacks of September 11, Representative Curt Weldon from Pennsylvania, who chairs the House Military Research and Development Subcommittee, had proposed a government-wide data mining agency tasked with supporting the intelligence community in developing threat profiles of terrorists.

To quote Weldon, "In the 21st century, you have to be able to do massive data mining, and nobody can do that today." The data mining agency proposed in 2000 by Weldon was to be known as the National Operations and Analysis Hub (NOAH) and would support high-level government policy makers by integrating more than 28 intelligence community networks, as well as the databases from a vast array of federal agencies. However, simply aggregating the data is not enough; it must also be mined to extract digital signatures of suspected terrorists and criminals.