5.4 Searching for Clues in Aviation Crashes: A Case Study
NASA developed a suite of data mining tools called Perilog designed to retrieve and organize contextually relevant data from any sequence of terms. Perilog has been used to sort through thousands of narrative reports in order to extract key terms for identifying the root causes of air crashes. The software measures the degree of contextual association for large numbers of term pairs in text or any sequence to produce models to measure their degree of similarity to a query model. It also develops a ranking of relevance and presents the search results in a table format.
Perilog was originally designed to support the FAA's Aviation Safety Reporting System (ASRS). The NASA software was used to analyze thousands of aviation accident incident reports, which typically contain free-form narrative descriptions written by participants, such as flight or ground crews, air traffic controllers, and other professionals. Perilog was used to sort through a voluminous number of incident reports in order to extract the dominant causes of airline crashes, such as mechanical failure or pilot error. Perilog relies on four methods for text mining:
Keyword-in-context search which retrieves narrative that contains one or more user-specified keywords in typical or selected context and ranks the narratives on their relevance to the keyword in context
A flexible, model-based phrase search that retrieves narrative that contains one or more user-specified phrases and ranks them on their relevance to the phrases
Model-based phrase generation, which produces a list of phrases from documents that contain a user-specified word or group of words
Narrative-based phrase discovery, which finds phrases that are related to topics of interest by generating a list of narratives similar in meaning to the keyword or phrase query
Relevance ranking is a process of sorting a list of items so that those likely to be of greater relevance to one's concerns and interests appear closer to the top of the list. Relevance ranking can help an analyst to read and interpret efficiently very large collections of narratives, reports, and text. Perilog can be used to sort through thousands of pages and rank and prioritize phrases in pairs by a relational metric value that is highest when there is a match:
Probe Term Term in Context Relational Metric Value FBI crash 205
Perilog's manipulation of patterned or sequential symbols, data, items, objects, events, causes, time spans, actions, attributes, entities, relations, and representations allows for searching of any type of information repository, not just text. What is interesting about this NASA-developed software is that it can perform smart retrieval of sound, voice, or audio data making it an ideal context search and retrieval tool for investigative monitoring analysis of multimedia. NASA is looking for a commercial developer to bring the government-developed software to market.
5.5 Clustering News Stories: A Case Study
We all have access to lots of information, but are seldom in a position to exploit it effectively for decision making. In times of crisis, this problem can be especially severe. Imagine you are a senior analyst besieged with news and intelligence reports of a hostage situation at an American embassy. Who is in charge of the terrorists? Is their group likely to attack other embassies?
How can computers help this process, which relies so critically on collective human understanding and insight, in the midst of the furor of a crisis? Genoa, a project of DARPA, is aimed at improving analysis and decision making in crisis situations by providing tools that allow analysts to collaborate in developing structured arguments in support of particular conclusions and to help predict likely future scenarios. Genoa also provides knowledge discovery tools to mine the information in these sources for important patterns, trends, and anomalies to discover nuggets of valuable information.
One of the challenges Genoa faces is to make it easy for analysts to take knowledge gleaned with the use of these discovery tools and embed it in a concise and useful form in an intelligence product as evidence in support of structured arguments. The MITRE Organization developed a console of various text mining software units that allows the analyst to select various text mining tools from a menu and, with just a few mouse clicks, assemble them to create a complex filter that fulfills whatever information discovery function is currently needed. A filter here is a tool that takes input information and turns it into some more abstract and useful representation. Filters can also weed out irrelevant parts of the input information.
For example, in response to the crisis situation discussed earlier, an analyst might use these mining tools to discover important nuggets of information in a large collection of news sources. This use of data mining tools can be illustrated by looking at TopCat, a MITRE-developed system that identifies different topics in a collection of documents and displays the key players for each topic. TopCat uses association rule mining technology to identify correlations among people, organizations, locations, and events, shown in Figure 5.1 in different shades and boxes. Clustering these correlations creates topics like the three in the following figure, built from six months of global news from several print, radio, and video sources—over 60,000 news stories in all.
Figure 5.1: Topics derived from clustering 60,000 news reports.
This allows the analyst to discover, say, an association between people involved in a bombing incident, which gives a starting point for further analysis (e.g., do McVeigh and Nichols belong to a common organization?). This, in turn, can lead to new knowledge that can be leveraged in the analytical model used to help predict whether this terrorist organization is likely to strike elsewhere in the next few days. Similarly, the third topic reveals the important players in an election in Cambodia. This discovered information can be leveraged to help predict whether the situation in Cambodia is going to explode into a crisis that affects U.S. interests.
Now, suppose an analyst wants to know more about the people in the last topic. Instead of reading more than 6,000 words of text from 10 articles on the topic, the analyst can compose a topic detection filter like TopCat with a biographical summarization filter that gleans facts about key persons from the topic's articles. The result of the composition is a short, 86-word summary, shown in Figure 5.2.
Figure 5.2: An 86-word summary of the news stories.
This summarization filter, developed under DARPA funding, identifies and aggregates descriptions of people from a collection of documents by means of an efficient syntactic analysis, the use of a thesaurus, and some simple natural language generation techniques. It also extracts from these documents salient sentences related to these people by weighting sentences based on the presence of the names of people, as well as the location and proximity of terms in a document and their frequency among other things.
TopCat and a summarization filter perform a function to collect broadcast news continuously in order to extract named entities and keywords and to identify the transcripts and sentences that contain them. The summarization filter includes a parameter to specify the target length or the reduction rate, allowing summaries of different lengths to be generated. For example, allowing a longer summary would mean that facts about other people (e.g., Pol Pot) would also appear in the summary.
This example illustrates how mining a text collection using a composed summarization filter can reveal important associations at varying levels of detail. The component-based text mining console allows these filters to be integrated easily into intelligence products such as reports and briefings. To help analysts present structured arguments and supporting information to decision makers, Genoa provides an electronic notebook briefing tool. Summarization filters can be associated with regions on a page in a briefing book that can be shared across a community of collaborating analysts. When a document or a folder of documents is dropped onto a region associated with a filter, the filter applies, and the textual summary or visualization appears in that region (http://www.mitre.org).