5.2 How Does Text Mining Work?

5.2 How Does Text Mining Work?

One of the obvious applications for text mining is in its use to monitor multiple online and wireless communication channels for the use of selected keywords such as anthrax or the names or aliases of individual or groups of suspects. Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. This is already being done in the United Kingdom using text mining software from Autonomy. More importantly, criminal investigators and intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are digital in nature, requiring the coordination and communication of perpetrators via channels that leave text trails investigators can analyze. As we shall see, there is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats.

As with some of the other data mining technologies covered in this book, text mining is one of many new tools for combating digital crimes in today's world. Almost all of these technologies have their roots in the field of AI, and almost all have been developed to amplify and assist human endeavors in the fields of science, commerce, and now forensics. There have been some recent developments and applications in the field of mining text-based data using an assortment of algorithms and various statistical and visualization schemes.

Text mining is similar to data mining because both deal with automating the analysis of large volumes of data and can be used for the purposes of profiling individuals, groups of entities, and companies. The techniques differ in the types of data they analyze and the methods they use to conduct their analyses. Data mining is primarily intended for analyzing and discovering relationships or ratios in structured data, both numeric and categorical. Conversely, text mining analyses specifically work with unstructured textual information in searching for concepts and clusters in thousands of documents or Web pages.

Data mining uses technologies like neural networks for detecting patterns or extracts predictive rules using machine-learning algorithms to automate the process of data analysis and profiling. However, a key differentiator between data mining and text mining is that the latter makes extensive use of lexical processing and analysis, word/phrase parsing, and other NLP techniques in order to highlight key concepts and relationships between words and clusters of documents, based on their content. In addition, text mining applications typically rely on advanced visualization to present an overview of document content; they also use XML and HTML to link large numbers of similar documents, which users can drill back through. Depending on the type of investigation and the format of the data, different tools will be required, including those based on text analysis.

5.3 Text Mining Applications

In the context of investigative data mining, text mining techniques and tools can be used to sort and organize large collections of text-based data, such as licenses, registrations, airline tickets, credit-card transactions, point-of-entry passport records, criminal files, transcripts of investigations, and any other type of text-based data set for which a name, word, or concept needs to be identified and tracked. However, as with every data mining project, the results returned from text mining are very much dependent on the quality, relevance, and objective of the analyst.

For text mining to be effective, the content and focus of the documents and databases is very important. For example, applying text mining to a collection of random e-mail files probably won't generate much in the way of relevant findings or lead to an ongoing investigation or counter-intelligence analysis unless the e-mail files are specifically those of confirmed suspects. However, using text mining to analyze the e-mails of a group of individuals related to or who have had some contact with a group of suspects in a wide area is likely to provide some important leads to an ongoing discovery-and-detect investigation, where the objective is to identify, for example, unknown associates in a criminal ring or terrorist cell.

Text mining software can also be used to construct investigation dossiers or internal intranet directories by classifying hundreds of thousands of documents based on multiple, inherent concepts found in the source text. For example, criminal files can be organized based on modus operandi by applying NLP techniques and other advanced algorithms. Text mining software can automatically identify and extract key concepts from investigation-related documented records. These concepts can be automatically linked to a taxonomy that can meet an agency's, department's, or specific investigative team's information requirements. These taxonomies provide users with a directory structure for exploring further via link analysis tools or by browsing or searching for the information through an intranet.

Because text mining extracts the key concepts in the documents rather than a single keyword, the taxonomies make it easier for investigators and analysts to find relevant case-related information existing in multiple, linked documents. Such concept-based indexing also eliminates the need to force documents into predefined categories. Text mining software also replaces manual categorization and tagging efforts that add to the costs and deployment/update times for agency- or department-wide portals. This type of organization of crime-related information allows for the institutionalization of modus operandi and of criminal detection procedures.

Text mining software uses the source text itself to automate portal taxonomy creation by extracting multiple key concepts from the documents, mapping the interrelationships between these concepts in the document collection, and creating a taxonomy database that references and links these concepts. For example, criminal cases can be organized by a text mining tool into distinct categories based on the type, time, location, modus operandi, rate, cost, and any other characteristics or feature the user decides, or they can be organized and clustered automatically by the software.