1.7 Text Mining

The explosion of the amount of data generated from government and corporate databases, e-mails, Internet survey forms, phone and cellular records, and other communications has led to the need for new pattern-recognition technologies, including the need to extract concepts and keywords from unstructured data via text mining tools using unique clustering techniques. Based on a field of AI known as natural language processing (NLP), text mining tools can capture critical features of a document's content based on the analysis of its linguistic characteristics. One of the obvious applications for text mining is monitoring multiple online and wireless communication channels for the use of selected keywords, such as anthrax or the names of individual or groups of suspects. Patterns in digital textual files provide clues to the identity and features of criminals, which investigators can uncover via the use of this evolving genre of special text mining tools.

Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. This is already being done in the United Kingdom using text mining software from Autonomy. More importantly, criminal investigators and counter-intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are electronic in nature, requiring the coordination and communication of perpetrators via networks and databases, which leave textual trails that investigators can track and analyze. There is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats and multiple languages.

Text mining tools and applications focus on discovering relationships in unstructured text and can be applied to the problem of searching and locating keywords, such as names or terms used in e-mails, wireless phone calls, faxes, instant messages, chat rooms, and other methods of human communication. Unlike traditional data mining, which deals with databases that follow a rigid structure of tables containing records representing specific instances of entities based on relationships between values in set columns, text mining deals with unstructured data (Figure 1.3).

click to expand
Figure 1.3: Text mining can extract the core content from millions of records.

Text mining can be used to extract and index all the words in a database, or a network, as the example shown in Figure 1.3 demonstrates, to find key intelligence, which can also be used for criminal and counter-intelligence purposes. Text software developed at the University of Texas exists that can detect when a person is lying three out of four times. The program looks at the words used and the structure of the message, which could be an e-mail.