1.7 Text Mining

1.7 Text Mining

The explosion of the amount of data generated from government and corporate databases, e-mails, Internet survey forms, phone and cellular records, and other communications has led to the need for new pattern-recognition technologies, including the need to extract concepts and keywords from unstructured data via text mining tools using unique clustering techniques. Based on a field of AI known as natural language processing (NLP), text mining tools can capture critical features of a document's content based on the analysis of its linguistic characteristics. One of the obvious applications for text mining is monitoring multiple online and wireless communication channels for the use of selected keywords, such as anthrax or the names of individual or groups of suspects. Patterns in digital textual files provide clues to the identity and features of criminals, which investigators can uncover via the use of this evolving genre of special text mining tools.

Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. This is already being done in the United Kingdom using text mining software from Autonomy. More importantly, criminal investigators and counter-intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are electronic in nature, requiring the coordination and communication of perpetrators via networks and databases, which leave textual trails that investigators can track and analyze. There is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats and multiple languages.

Text mining tools and applications focus on discovering relationships in unstructured text and can be applied to the problem of searching and locating keywords, such as names or terms used in e-mails, wireless phone calls, faxes, instant messages, chat rooms, and other methods of human communication. Unlike traditional data mining, which deals with databases that follow a rigid structure of tables containing records representing specific instances of entities based on relationships between values in set columns, text mining deals with unstructured data (Figure 1.3).

click to expand
Figure 1.3: Text mining can extract the core content from millions of records.

Text mining can be used to extract and index all the words in a database, or a network, as the example shown in Figure 1.3 demonstrates, to find key intelligence, which can also be used for criminal and counter-intelligence purposes. Text software developed at the University of Texas exists that can detect when a person is lying three out of four times. The program looks at the words used and the structure of the message, which could be an e-mail.

1.8 Neural Networks

Probably one of the most powerful tools for investigative data miners, in terms of detecting, identifying, and classifying patterns of digital and physical evidence is the neural network, a technology that has been around for 20 years. Although neural networks were proposed in the late 1950s, it wasn't until the mid-1980s that software became sufficiently sophisticated and computers became powerful enough for actual applications to be developed. During the 1990s, the development of commercial neural network tools and applications by such firms are Nestor, NeuralWare, and HNC became reliable enough, enabling their widespread use in financial, marketing, retailing, medical, and manufacturing market sectors. Ironically, one of the first and most successful applications was in the area of the detection of credit card fraud.

Today, however, neural networks are being applied to an increasing number of real-world problems of considerable complexity. Neural networks are good pattern-recognition engines and robust classifiers with the ability to generalize in making decisions about imprecise and incomplete data. Unlike other traditional statistical methods, like regression, they are able to work with a relatively small training sample in constructing predictive models; this makes them ideal in criminal detection situations because, for example, only a tiny percentage of most transactions are fraudulent.

A key concept about working with neural networks is that they must be trained, just as a child or a pet must, because this type of software is really about remembering observations. If provided an adequate sample of fraud or other criminal observations, it will eventually be able to spot new instances or situations of similar crimes. Training involves exposing a set of examples of the transaction patterns to a neural-network algorithm; often thousands of sessions are recycled until the neural network learns the pattern. As a neural network is trained, it gradually become skilled at recognizing the patterns of criminal behavior and features of perpetrators; this is actually done through an adjustment of mathematical formulas that are continuously changing, gradually converging into a formula of weights that can be used to detect new criminal behavior or other criminals (Figure 1.4).

click to expand
Figure 1.4: A neural net can be trained to detect criminal behavior.

Neural networks can be used to assist human investigators in sorting through massive amounts of data to identify other individuals with similar profiles or behavior. Neural networks have been used to detect and match the chromatographic signature of chemical components, such as kerosene in arson cases, by forensic investigators at the California Department of Justice.

One unique type of neural networks known as Kohonen nets or self-organizing maps (SOM), can be used to find clusters in databases for the autonomous discovery of similarities. SOMs have been used to cluster and match unsolved crimes and criminals' modi operandi (MOs) or methods of operation. SOMs work through a process known as unsupervised learning, because this type of neural network does not need to be trained. Instead it automatically searches and finds clusters hidden in the data. Police departments in the United Kingdom and in the state of Washington are already doing this type of clustering analysis. Investigators from the West Midlands Police in Birmingham used SOMs to model the behavior of sex offenders, while the Americans used the clustering neural networks to map homicides in the CATCH project (Figure 1.5).

click to expand
Figure 1.5: CATCH— Computer Aided Tracking and Characterization of Homicides.