DATA MINING FOR WHAT? | Computer Forensics: Computer Crime Scene Investigation (With CD-ROM) (Networking Series)

< Day Day Up >

The use of data mining for information warfare is growing rapidly. The number of data-mining consultants, as well as the number of commercial tools available to the “nonexpert” user, are also quickly increasing. It is becoming easier than ever to collect datasets and apply data-mining tools to them. As more and more nonexperts seek to exploit this technology to help with their business, it becomes increasingly important that they understand the underlying assumptions and biases of these tools. There are a number of factors to consider before applying IW data mining to a database. In particular, there are important issues regarding the data that should be examined before proceeding with the data-mining process. Although these issues may be well-known to the data-mining expert, the nonexpert is often unaware of their importance.

Now let’s focus on three specific issues. Each issue is illustrated through the use of brief examples. Also, insight is provided for each issue on how it might be problematic, and suggestions are made on which techniques can used for approaching such situations.

The purpose here is to help the nonexpert in IW data mining to better understand some of the important issues of the field. Particular concern is also established here with characteristics of the data that may affect the overall usefulness of the IW data-mining results. Some recent experiences, and the lessons learned from them, are described. These lessons, together with the accompanying discussion, will help to both guide the IW data-collection process and better understand what kinds of results to expect.

One cannot blindly “plug-and-play” in IW data mining. There are a number of factors to consider before applying data mining to any particular database. This general warning is not new. Many of these issues are well-known by both the data mining experts and a growing body of nonexpert, data “owners.” For instance, the data should be “clean,” with consistent values across records and containing as few errors as possible. There should not be a large number of missing or incomplete records or fields. It should be possible to represent the data in the appropriate syntax for the required data-mining tool (attribute/value pairs).

As previously mentioned, this part of the chapter will discuss three specific, but less well-known, issues. Each will be illustrated through real-world experiences. The first is the impact of data distribution. Many IW data-mining techniques perform class or group discrimination, and rely on the data-containing representative samples of all relevant classes. Sometimes, however, obtaining samples of all classes is surprisingly difficult. The second issue is one of applicability and data relevance. High-quality data, combined with good data-mining tools, does not ensure that the results can be applied to the desired goal. Finally, this part of the chapter will discuss some of the issues associated with using text (narrative fields in reports) in data mining. The current technology cannot fully exploit arbitrary text, but there are certain ways text can be used.

These three issues are not new to the field. Indeed, for many IW data-mining experts, these are important issues that are often well understood. For the nonexpert, however, these issues can be subtle or appear deceivingly simple or unimportant. It is tempting to collect a large amount of clean data, massage the representation into the proper format, hand the data tape to the consultant, and expect answers to the most pressing business questions. Although this part of the chapter does not describe all of the potential problems one might face, it does describe some important issues, illustrate why they might be problematic, and suggest ways to effectively deal with these situations.

Two Examples

The discussion of data distribution, information relevance, and use of text will be illustrated with examples from two current projects. The first involves a project with the Center for Advanced Aviation Systems Development (CAASD) in the domain of aviation safety. In this project, one of the primary goals is to help identify and characterize precursors to potentially dangerous situations in the aviation world. One particular way to do this is to mine accident and incident reports involving aircraft for patterns that identify common precursors to dangerous situations. For any type of flight—commercial, cargo, military, or pleasure—accidents (and often less serious incidents) are investigated. A report is filed containing a variety of information such as time of day, type of aircraft, weather, pilot age, and pilot experience. These reports often include the inspector’s written summary. One task involves using collections of these reports to try to identify and characterize those situations in which accidents occur. A source of such reports is the National Transportation Safety Board (NTSB).

The second project involves targeting vehicles for law enforcement. In this particular instance, vehicles (mostly passenger vehicles and small trucks) arrive at an inspection stop. At this primary stop, a brief inspection is conducted to decide if further examination is necessary. There is typically a constant flow of cars to be processed, so excessive time cannot be taken. This first inspection typically takes 20 to 30 seconds. If the primary inspector feels it is warranted (and there are any number of reasons that justify this), any vehicle can be pulled out for secondary inspection. This secondary inspection and background check is more thorough. If the driver/vehicle is found to be in violation of the particular laws under consideration, then various information concerning both driver and vehicle is collected and entered into the “violators” database. The goal of this project is to find a way to better profile these violating drivers and vehicles, so that the primary inspectors can more accurately identify likely suspects and send them for secondary inspection.

Data Distribution

Let’s first discuss the issue of data distribution. Of particular concern is the situation in which the data lacks certain types of examples. Consider the aviation safety domain. One goal of the project in this domain is to characterize situations that result in accident flights. An obvious source of information is the NTSB’s database of accident reports.

Note

This database does not contain records about uneventful flights (the NTSB is an accident investigation agency). That is, the data are unevenly distributed between records of accident flights and records of uneventful flights.

This lack of reports about uneventful flights has important consequences for a significant class of data-mining techniques. When given the data containing only accident flights, each of the approaches in this class concludes that all flights contain accidents. Such a hypothesis is clearly incorrect. The majority of the flights are uneventful. Also, such a hypothesis is not useful because it does not offer any new insight on how to differentiate the accident flights from the uneventful ones. Furthermore, some of the most popular IW data-mining tools, including decision tree inducers, neural networks, and nearest neighbor algorithms, fall into this class of techniques. (They assume that the absence of uneventful flights in the data implies that they do not exist in the world.)

To continue this discussion, it is necessary to first define some terms used in data mining. The “target concept” is that concept you are trying to learn. In the aviation domain, the target concept is accident flights. Consequently, each example of an accident flight (each accident report in the database) is called a member of the target concept, and each uneventful flight is a nonmember of the target concept. The NTSB data do not contain records of uneventful flights. That is, there are no descriptions of nonmembers of the target concept. The problem of learning to differentiate members from nonmembers is called a “supervised concept learning problem.”

Note

It is called supervised because each example in the data contains a label indicating its membership status for the target concept.

For example, a supervised concept learner uses a training sample as input. A training sample is a list of examples, labeled as members or nonmembers, which is assumed to be representative of the whole universe. The supervised concept learner produces hypotheses that discriminate the members and nonmembers in the sample. Many IW data-mining tools use supervised concept learners to find patterns.

Let’s say that a supervised concept learner makes the closed-world assumption that the absence of nonmembers in the data implies that they do not exist in the universe. Why do some of the popular learners make the closed-world assumption? The case of decision tree learners provides a good illustration. These learners partition the training sample into pure subsamples, containing either all member or all nonmembers. The partitioning of the training sample drives the rule generation. That is, the learners introduce conditions that define partitions of the training sample; each outcome of a condition represents a different subsample. Ultimately, the conditions will become part of the discrimination rules. Unfortunately, if the input sample contains only data that are members of the target class, the training sample is already pure and the decision tree learner has no need to break-up the sample further. As a consequence, the rules commit to classifying all new data as members of the target class before conducting any tests. Thus, in the aviation project, all flights would be classified as accident flights, because the learner never saw any uneventful flights. This is not to say that learners employing the closed-world assumption are inappropriate in all, or even most situations. For many problems, when representative data from all the concepts involved is available, these learners are both effective and efficient.

Applicability and Relevance of Data

Even when collected data is of high quality (clean, few missing values, proper form, etc.) and the IW data-mining algorithms can be successfully run, there still may be a problem of relevance. It must be possible to apply the new information to the situation at hand. For instance, if the data mining produces typical “if...then...” rules, then it must be possible to measure the values of the attributes in the condition (“if” part) of those rules. The information about those conditions must be available at the time the rules will be used. Consider a simple example where the goal is to predict if a dog is likely to bite. Assume data are collected on the internal anatomy of various dogs, and each dog is labeled by its owner as either “likely” or “unlikely” to bite. Assume further that the data-mining tools work splendidly, and it is discovered that the following (admittedly contrived) rules apply: Rule 1—If the rear molars of the dog are worn, the dog is unlikely to bite; and, Rule 2—If the mandibular muscles are over-developed, the dog is likely to bite.

These may seem like excellent rules. However, if faced with a strange, angry dog late at night, these rules would be of little help in deciding whether you are in danger. There are two reasons for this: First, there is a time constraint in applying the rules. There are only a few seconds to check if these rules apply. Second, even without such a constraint, the average person probably can’t make judgments about molar wear and muscle development. The lesson here is that just because data are collected about biting (and nonbiting) dogs, it does not mean one can predict whether a dog will bite in every situation.

In the vehicle-targeting task described earlier, a similar situation occurred. The initial goal was very specific: develop a set of rules, a profile, that the primary inspectors could use to determine which vehicles to pull out for secondary inspection. As mentioned, much more information is collected concerning actual violators than for those that are just passed through the checkpoint. Thus, the initial goal was to profile likely violation suspects based on the wealth of information about that group. The problem, noticed before any analysis was done, was that the information that would make up the profiles would not be applicable to the desired task. As mentioned, the primary inspectors have only a short amount of time to decide whether a particular vehicle should be pulled out for secondary inspection. During that time, they have access to only superficial information. That is, the primary inspectors don’t have quick access to much of the background knowledge concerning the driver and vehicle. Yet this is precisely the knowledge collected during seizures initially chosen to build profiles. Thus, they have no way to apply classification rules that measure features such as “number of other cars owned,” “bad credit history,” or “known to associate with felons” (types of data collected on violation vehicles and drivers).

The problem here is not that the data is “bad,” or even that the data is all from the target concept. The problem is that the data cannot be applied to the initially specified task. How does this situation come about in general? The answer involves a fairly common situation. Often, IW data mining begins with data that has been previously collected, usually for some other purpose. The assumption is made that since the collected data is in the same general domain as the current problem, it must be usable to solve this problem. As the examples show, this is often not the case. In the vehicle-targeting task, the nature of the law enforcement system is such that a great deal of information is collected and recorded on violators. No one ever intended to use this information as a screening tool at stop points. Thus, it is important to understand the purpose for which a set of data was collected. Does it address the current situation directly? Similarly, when data is collected for the specific task at hand, careful thought must go into collecting the relevant data.

There are two primary ways to address this problem of data irrelevancy. The most obvious is to use additional data from another source. It may be that different data already exists to address the primary question. For instance, returning to the dogs example, general aggressiveness characteristics for different breeds of dogs have been determined. Using this data, rather than the original data, deciding how likely a dog is to bite is reduced to the problem of determining its breed (often done by quick visual inspection). When the necessary data does not already exist, it may be necessary to collect it. Some of this data collection will likely take place in the vehicle-targeting project. In this case, data must be collected that relates directly to the information available to the inspectors at the initial inspection. For example, the demeanor of the driver may be an important feature. Of course, collecting new data may be a very expensive process. First, the proper attributes to collect must be determined. This often involves discussions and interviews with experts in the field. Then the actual data-collection process may be quite costly. It may be that an inordinate amount of manpower is required, or that certain features are difficult to measure.

If additional data cannot be obtained, there is another, often less desirable way to address this issue. It may be possible to alter the initial goals or questions. This will clearly require problem-specific domain expertise to address a few simple questions: Is there another way to address the same issue? Is there another relevant issue that can be addressed directly with this data? In the vehicle-targeting domain, only those attributes that were directly accessible to the inspector were used. A good example would be looking at simple statistical patterns for time of day, weather, season, and holidays. This is not a deep analysis and doesn’t quite “profile” likely violators, but it makes progress toward the initial goal. Another alternative is to use the violator database to profile suspects for other situations. It may be that profiles of certain types of violators bear similarities to other criminal types. Perhaps this information can be used elsewhere in law enforcement. Admittedly, this latter solution does not address the initial issue: helping the primary inspectors decide who to pull out for secondary inspection. However, it may not be possible to achieve that goal with this data and the given time constraints. It is important to understand this potential limitation early in the process, before a great deal of time, effort, and money has been invested.

Combining Text and Structured Data

IW data mining is most often performed on data that is highly structured. Highly structured data have a finite, well-defined set of possible values, as is most often seen in databases. An example of structured data is a database containing records describing aircraft accidents that includes fields such as the make of an airplane and the number of hours flown by the pilot. Another source of valuable yet often unused information is unstructured text. Although more difficult to immediately use than structured data, data mining should make use of these available text resources.

Text is often not used during IW data mining because it requires a preprocessing step before it can be used by available tools such as decision trees, association rule methods, or clustering. These techniques require structured fields with clearly defined sets of possible values that can be quickly counted and matched. Such techniques sometimes also assume that values are ordered and have well-defined distances between values. Text is not so well behaved. Words may have multiple meanings depending on context (polysemy), multiple words may mean the same thing (synonymy), or may be closely related (hypernymy). These are difficult issues that are not yet totally solved, but useful progress has been made and techniques have been developed so that text can be considered a resource for data mining.

One way to exploit text, borrowed from information retrieval, is to use a vector-space approach. Information retrieval is concerned with methods for efficiently retrieving documents relevant to a given request or query. The standard method for doing this is to build weight vectors describing each document and then compare the document vector to the query vector. More specifically, this method first identifies all the unique words in the document collection. Then this list of words is used to build vectors of words and associated weights for the query and each of the documents. Using the simplest weighting method, this vector has a value of 1 at position x when the x^th vocabulary word is present in the document; otherwise it has a value of 0. Every document and query is now described by a vector of length equal to the size of the vocabulary. Now each document vector can be compared to every other document by comparing their word vectors. A cosine-similarity measure (which projects one vector along another in each dimension) will then provide a measure of similarity between the two corresponding documents. Surprisingly, although this approach discards the structure in the text and ignores the problems of polysemy and synonymy altogether, it has been found to be a simple, fast baseline for identifying relevant documents.

A variant of this vector-space approach was used on the airline safety data to identify similar accidents based on a textual description of the flight history. The narrative description of each accident was represented as a vector and compared to all other narratives using the approach described earlier. One group of accidents identified by this technique can be described as planes that were “veering to the left during takeoff.” The following accident reports were found to be similar in this respect:

MIA01LA055: During takeoff roll he or she applied normal right rudder to compensate for engine torque. The airplane did not respond to the pilot input and drifted to the left.
ANC00LA099: Veered to the left during the first attempt to take off.
ANC00LA041: Pilot added full power and the airplane veered to the left.

Identifying this kind of a group would be difficult using fixed fields alone. This technique can also be used to find all previous reports similar to a given accident, or to find records with a certain combination of words. This can be a useful tool for identifying patterns in the flight history of the accident so that the events leading up to different accidents can be more clearly identified.

The information stored in text can be extracted in other ways as well. A collection of documents and a taxonomy of terms are combined so that maximal word or category associations can be calculated. It could also be used in the airline safety domain to calculate, for example, which class of mechanical malfunctions occurred most often in winter weather.

Another approach very relevant to IW data mining from text is information extraction (IE). Information extraction is interested in techniques for extracting specific pieces of information from text and is the focus of the DARPA Message Understanding Conference (MUC). The biggest problem with IE systems is that they are time-consuming to build and domain specific. To address this problem a number of tools have (and continue to be) developed for learning templates from examples such as CRYSTAL, RAPIER, and AutoSlog. IE tools could be used in the airline safety data to pull out information that is often more complete in the text than in the fixed fields. This work is geared toward filling templates from text alone, but often the text and structured fields overlap in content.

An example of just such an overlap can be found in the NTSB accident and incident records. The data in the NTSB accident and incident records contains structured fields which together allow the investigator to identify human factors that may be important to the accident scene. However, it was found that these fields are rarely filled out completely enough to make a classification: 95% of the records that were identified as involving people could only be classified as “unknown.” IE methods could be used to reduce this large unknown rate by pulling information out of the narrative, which described if a person in the cockpit made a mistake. Such an approach could make use of a dictionary of synonyms for “mistake” and parser for confirming if the mistake was an action made by the pilot or copilot and not in a sentence describing, for example, the maintenance methods.

Although IW data mining has primarily concerned itself with structured data, text is a valuable source of information that should not be ignored. Although automatic systems that completely understand the text are a still a long way off, one of the surprising recent results is that simple techniques, which sometimes completely ignore or only partially address the problems of polysemy, synonymy, and complex structure of text, still do provide a useful first cut for mining information from text. Useful techniques, such as the vector-space approach and learned templates from information extraction, can allow IW data miners to make use of the increasing amount of text available on-line.

< Day Day Up >