5.2 How Does Text Mining Work?

One of the obvious applications for text mining is in its use to monitor multiple online and wireless communication channels for the use of selected keywords such as anthrax or the names or aliases of individual or groups of suspects. Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. This is already being done in the United Kingdom using text mining software from Autonomy. More importantly, criminal investigators and intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are digital in nature, requiring the coordination and communication of perpetrators via channels that leave text trails investigators can analyze. As we shall see, there is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats.

As with some of the other data mining technologies covered in this book, text mining is one of many new tools for combating digital crimes in today's world. Almost all of these technologies have their roots in the field of AI, and almost all have been developed to amplify and assist human endeavors in the fields of science, commerce, and now forensics. There have been some recent developments and applications in the field of mining text-based data using an assortment of algorithms and various statistical and visualization schemes.

Text mining is similar to data mining because both deal with automating the analysis of large volumes of data and can be used for the purposes of profiling individuals, groups of entities, and companies. The techniques differ in the types of data they analyze and the methods they use to conduct their analyses. Data mining is primarily intended for analyzing and discovering relationships or ratios in structured data, both numeric and categorical. Conversely, text mining analyses specifically work with unstructured textual information in searching for concepts and clusters in thousands of documents or Web pages.

Data mining uses technologies like neural networks for detecting patterns or extracts predictive rules using machine-learning algorithms to automate the process of data analysis and profiling. However, a key differentiator between data mining and text mining is that the latter makes extensive use of lexical processing and analysis, word/phrase parsing, and other NLP techniques in order to highlight key concepts and relationships between words and clusters of documents, based on their content. In addition, text mining applications typically rely on advanced visualization to present an overview of document content; they also use XML and HTML to link large numbers of similar documents, which users can drill back through. Depending on the type of investigation and the format of the data, different tools will be required, including those based on text analysis.