5.5 Clustering News Stories: A Case Study


5.5 Clustering News Stories: A Case Study

We all have access to lots of information, but are seldom in a position to exploit it effectively for decision making. In times of crisis, this problem can be especially severe. Imagine you are a senior analyst besieged with news and intelligence reports of a hostage situation at an American embassy. Who is in charge of the terrorists? Is their group likely to attack other embassies?

How can computers help this process, which relies so critically on collective human understanding and insight, in the midst of the furor of a crisis? Genoa, a project of DARPA, is aimed at improving analysis and decision making in crisis situations by providing tools that allow analysts to collaborate in developing structured arguments in support of particular conclusions and to help predict likely future scenarios. Genoa also provides knowledge discovery tools to mine the information in these sources for important patterns, trends, and anomalies to discover nuggets of valuable information.

One of the challenges Genoa faces is to make it easy for analysts to take knowledge gleaned with the use of these discovery tools and embed it in a concise and useful form in an intelligence product as evidence in support of structured arguments. The MITRE Organization developed a console of various text mining software units that allows the analyst to select various text mining tools from a menu and, with just a few mouse clicks, assemble them to create a complex filter that fulfills whatever information discovery function is currently needed. A filter here is a tool that takes input information and turns it into some more abstract and useful representation. Filters can also weed out irrelevant parts of the input information.

For example, in response to the crisis situation discussed earlier, an analyst might use these mining tools to discover important nuggets of information in a large collection of news sources. This use of data mining tools can be illustrated by looking at TopCat, a MITRE-developed system that identifies different topics in a collection of documents and displays the key players for each topic. TopCat uses association rule mining technology to identify correlations among people, organizations, locations, and events, shown in Figure 5.1 in different shades and boxes. Clustering these correlations creates topics like the three in the following figure, built from six months of global news from several print, radio, and video sources—over 60,000 news stories in all.

click to expand
Figure 5.1: Topics derived from clustering 60,000 news reports.

This allows the analyst to discover, say, an association between people involved in a bombing incident, which gives a starting point for further analysis (e.g., do McVeigh and Nichols belong to a common organization?). This, in turn, can lead to new knowledge that can be leveraged in the analytical model used to help predict whether this terrorist organization is likely to strike elsewhere in the next few days. Similarly, the third topic reveals the important players in an election in Cambodia. This discovered information can be leveraged to help predict whether the situation in Cambodia is going to explode into a crisis that affects U.S. interests.

Now, suppose an analyst wants to know more about the people in the last topic. Instead of reading more than 6,000 words of text from 10 articles on the topic, the analyst can compose a topic detection filter like TopCat with a biographical summarization filter that gleans facts about key persons from the topic's articles. The result of the composition is a short, 86-word summary, shown in Figure 5.2.

click to expand
Figure 5.2: An 86-word summary of the news stories.

This summarization filter, developed under DARPA funding, identifies and aggregates descriptions of people from a collection of documents by means of an efficient syntactic analysis, the use of a thesaurus, and some simple natural language generation techniques. It also extracts from these documents salient sentences related to these people by weighting sentences based on the presence of the names of people, as well as the location and proximity of terms in a document and their frequency among other things.

TopCat and a summarization filter perform a function to collect broadcast news continuously in order to extract named entities and keywords and to identify the transcripts and sentences that contain them. The summarization filter includes a parameter to specify the target length or the reduction rate, allowing summaries of different lengths to be generated. For example, allowing a longer summary would mean that facts about other people (e.g., Pol Pot) would also appear in the summary.

This example illustrates how mining a text collection using a composed summarization filter can reveal important associations at varying levels of detail. The component-based text mining console allows these filters to be integrated easily into intelligence products such as reports and briefings. To help analysts present structured arguments and supporting information to decision makers, Genoa provides an electronic notebook briefing tool. Summarization filters can be associated with regions on a page in a briefing book that can be shared across a community of collaborating analysts. When a document or a folder of documents is dropped onto a region associated with a filter, the filter applies, and the textual summary or visualization appears in that region (http://www.mitre.org).




Investigative Data Mining for Security and Criminal Detection
Investigative Data Mining for Security and Criminal Detection
ISBN: 0750676132
EAN: 2147483647
Year: 2005
Pages: 232
Authors: Jesus Mena

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net