5.7 Text Mining Threats

5.7 Text Mining Threats

Most of what people do with computers centers on text. Individuals, including criminals and terrorists create, transfer, read, search, edit, and otherwise transform textual information in myriad ways. Text mining technology integrates multiple strategies, including statistical, keyword, grammar-based, and pattern-based, as well as diverse information sources, including linguistic, conceptual, and domain knowledge, to develop text analysis quickly and efficiently for investigations and analyses. As we have seen text mining uses NLP and proprietary algorithms in some cases, coupled with neural-network or vector analyzers for mapping, clustering, and organizing concepts from unstructured documents and other textual data sources.

Information technology, and especially the Web, has changed the concept of war, forensics, defense, and security. The new concepts of cyber crime and terrorism have altered criminal investigations and counter-intelligence. Before, the major task for any criminal investigation and intelligence activity was first to gather the information; today, that information is available. In fact, analysts and investigators are drowning in it. The new enemy of any intelligence activity is the daily avalanche of information that analysts have to collect, read, filter, and report on. Text mining technology offers a solution to this otherwise impossible task. Text mining allows the collection and analysis of raw data from diversified sources within few seconds, sources such as the following:

  • The Web

  • News agencies

  • Press reports

  • CDs

  • E-mails

  • Chat rooms

  • Databases

  • Forums

  • Newsgroups

Using different text mining tools and techniques, it is possible to identify the context of the communication, the thematic relationships between documents. This approach not only allows the capture of the opinions, tastes, truth, and emotions embedded in the data, but also allows the analysts to monitor trends and significant dynamics within the information flow.

It is clear that text mining will become a fundamental tool for any investigative and counter-intelligence activity. The Pentagon's 100-page document called Response to Transnational Threats describes how the military should respond to the threat of saboteurs and bombers aiming for violence, not victory. The solution is to develop a set of agent-based tools that includes micro-robots, bio-sniffers, and sticky electronics, coupled with text mining technology and capabilities. The Office of Advanced Information Technology at the CIA is tackling the information overflow issue with a set of intelligent software agents using tools like Oasis which can convert audio signals from television and radio broadcasts into text, or FLUENT, which enables a user to conduct computer searches of documents that are in a language the user does not understand. The user can put English words into the search field, such as "nuclear weapons," and documents in languages such as Russian, Chinese, and Arabic will pop up.

There is a definite distinction between how text mining tools work with unstructured textual information and how data mining tools work with structured data sets; however, both types of technologies attempt to extract some insight that can be used by forensic investigators and analysts. Most text mining tools and techniques use all or some of the following processes:

  • Natural language processing for capturing critical features of a document's content based on the analysis of its linguistic characteristics

  • Information retrieval for identifying those documents in a document collection that match a set of criteria

  • Routing and filtering for automatically delivering information to the appropriate destination according to subject or content

  • Document summarization for producing a compressed version or summary of a documents or collection of text, such as e-mails, with a summation of its content

  • Document clustering for grouping textual sources according to similarity of content, with or without predefined categories, as a way of organizing large collections of documents

There are various applications related to the forensic criminal detection and intelligence-gathering fields that lend themselves aptly to text mining technology, including the surveillance and identification of terrorists. For example, terrorist groups using chemical weapons, biological weapons, explosives, or other nonconventional weapons will have experts with specific knowledge. These individuals will have had to study, perform research, and attend technical seminars and conferences. All of these activities would leave electronic traces scattered across the Web, universities, and other organizations' networks and the registration databases of organizers. Using a text-clustering tool, these connections can be uncovered and the names or aliases detected.

Frequently, the only traces available after a terrorist attack are the letters or the communications claiming the act. The ability of text mining to analyze the style and the concepts expressed in the communications can be very helpful in establishing connections and patterns between documents. By finding similarities between the styles of these anonymous letters, groups of individuals or people can be identified and linked. Text mining analyses can spot connections, similarities, and patterns in declarations or statements that could suggest links between individuals that officially don't have any connection.

Good applications for using this type of technology are situations where a large volume of text-based data needs to be read, organized, and analyzed. In fact, investigators having to sort through any large unstructured text data sources, such as e-mail, word processing files, PowerPoint presentations, Excel and Lotus spreadsheets, PDF files, Lotus Notes archives, intranet and Internet server log files, Web pages, chat files, newsgroup files, interrogation scripts, investigation questionnaires, live chaVIRC files, and online news feeds can benefit from the use of text mining tools and techniques.

For counter-intelligence analysts working with documents containing a high level of focused content, such as scientific, technical, and other research documents, are excellent sources for text mining because they are highly informative in extracting only the most important content. These tools can be used to search, sort, and discover key clues in large collections of textual databases. They can also be used by agencies and departments to organize internal investigation—related case files in order to distribute it effectively to field investigators and intelligence analysts.