10.7 Data Mining for Intrusion Detection: A Case Study from the Mitre Corporation

In the Network Operations Center of the future, the security analyst will come to work in the morning, sit down with a cup of coffee, and press the What's New? button on the network monitoring and analysis screen. A list of suspicious incidents and attempted intrusions (more commonly called attacks) on the network will appear. Perhaps there is a file transfer at 2:00AM from a host that usually only has activity during business hours. The analyst will then investigate these incidents to identify them as, for example, attack or false alarm. The analyst will also be presented with distilled descriptions of attacks that he or she had identified the previous day.

An essential technology in this scenario will be data mining. Data mining analysis will determine the bounds for normal network activity, and data mining techniques will enable the software to spend the night determining which characteristics of a previously identified attack activity distinguish it from normal network usage.

To understand the improvement this will represent, it is necessary to understand the current network intrusion detection (ID) environment. Software sensors deployed along the network record activity: the initiation of an Internet connection from host A to host B, for example, or a single outside host trying to connect to every MITRE host. Each sensor records certain important pieces of information about this activity, such as the time of day and the duration of the connection. This information is stored in a database that easily accrues millions of records each day. On a regular basis, security analysts sift through this data looking for the most serious attacks. There can be thousands of suspicious activity alarms, and each requires further analysis to understand its purpose fully. Moreover, as commercial ID software currently favors heightened sensitivity, many of the alerts generated are false alarms and result in wasted time.

One of the most serious limitations in identifying and describing new attacks is that there is simply so much data that security experts are not able to examine thoroughly every single alerted activity. And, as data collection grows with increased network usage, little is being done to help mitigate this situation by performing analysis to determine which data is the most relevant and which data is unnecessary to collect.

This area of data overload is where data mining can make its most significant contribution. A number of MITRE research projects have begun to explore the use of data mining to address data overload in ID by taking one of two basic approaches: profiling or classification. In profiling, the goal is to establish some notion of normal and then look for deviations from that. In classification, we take known attacks and try to determine meaningful features that distinguish that set of traffic from the remainder of the traffic.

Of these two approaches, classification has been used less often in the ID environment. This is because it is crucial for classification analysis that there be adequate collections of data representing both attacks and nonattacks. Because this type of analysis is new to the ID world, rarely is this information collected in the proper form. Without explicit identifiers on identified attack records, it has been nearly impossible for classifiers to learn to discriminate between attacks and nonattacks.

MITRE's current Data Mining in ID project is starting to address this deficiency by enabling security analysts to tag important records in the database and assign them to meaningful classes (e.g., attack, probe, legitimate). By providing the necessary capabilities for labeling attacks and a better way to maintain the history of intrusion behavior, this work represents a significant enhancement to the existing security infrastructure. This labeled data will be used to explore and test various data mining classification techniques. This project has also begun to perform profiling on individual hosts. This profiling analysis can operate on the basic network traffic data that is already collected.

The hope is that by looking at the traffic to and from specific machines, unusual activity can be identified. The initial approach involves doing simple statistical analyses of isolated features. For example, Figure 10.1 shows a 30-day summary of the frequency of File Transfer Protocol (FTP) connections to a particular host for each hour of the day. Notice that the activity from 1 AM to 2AM is outside the hours during which the vast majority of connections are made; analysts should be alerted so they can investigate that activity further. The next stage of this project will use data clustering techniques to identify more sophisticated partitions of common activity for that host. Then, traffic that does not fit into any of the normal groups will be reported to the security analyst for further investigation.

click to expand
Figure 10.1: Thirty-day summary of File Transfer Protocol connections.

In other emerging work, MITRE is addressing the issue of false alarms produced by current ID sensors. This work uses data mining to look for recurring sequences of alarms to help understand which alarms might be the result of legitimate usage. For example, alarm A may be frequently followed by alarm B as a result of legitimate operations. Once this is recognized, future occurrences of this sequence can be filtered out. MITRE is working on an approach that includes filtering out data that captures "common" connection activity. It makes use of association rule detection to identify frequent host parings. For example, perhaps host X regularly connects to host Y four times a day. Once these common connections have been removed, the remaining data is fed to a classification system to detect attacks. This work has been successfully tested on synthetically generated data, and it will soon be applied to actual network data.

MITRE, like other organizations and agencies, currently makes heavy use of human analysts in identifying real intrusion attacks from the large amount of log data collected. Standard procedure is to review the previous day's sensor events in the morning. The large numbers of raw sensor events, most of which are uninteresting, make detecting real attacks or potential problems difficult. In this context, data mining is not used to replace the human analyst, but to reduce the burden by allowing him to focus his expertise on those alarms most likely to be real intrusion attacks. The following details the overall objective of the project:

      The Problem:      Data consists of individual sensor events (sensorlog      database records) that need to be both aggregated      into an incident and classified, but which do we do      first?      The Approach:      Construct features for individuals that capture      relationship to aggregate      * How many other records have the same srcip        as this record?      * How many other records have the same srcip        and dstport as this record?      An Intrusion Event:      Base - collected by network sensors      Examples: date, type of sensor, protocol, srcip, dstip,                srcport, dstport      Incident - relationship to known security incidents      Example: has this srcip/dstip been listed in an               incident recently?      Record - data lookups specific to a single record      Example: duration, endtime, starttime, highport,               srczone, hostsrcip      Host - data related to the source or destination host      Example: #alarms with same srcip &dstip,               #other alarms with same srcip      Time Window - statistics gathered over time      Example: avg. time between connections for a srcip               or dstip      The Goal: To automatically identify interesting      anomalous behavior      The Tasks:      * Use sensor log events not identified as incidents      * Filter attributes based on analyst feedback      * Build Web interface for easy viewing of generated        anomalies      * Classify anomalies into incident categories      Example of Anomalies:      Anomaly #14. 3 case(s). Signficance level: 0.015              highdstport = no (281 cases, 98.6% 'yes')                  synflag = no      RECORD:130330539,we1,log, 2000/02/13,2000,02,13,14,             38,46,sun,bus,?,?,?,?,?,?,3,netbios-ns,             tcp,23,137,206.184.139.134,192.47.242.29,             r,2451588,in,no,no,no, ?,no,no,no,no      Interpretation: This is a possible scan attempt to      bypass firewall?      Anomaly #32. 4 case(s). Signficance level: 0.004              srcmitre = no  (1692 cases, 99.65% 'yes')                  dstip = 192.188.104.221      RECORD:143722187,we1,log,2000/03/05, 2000,03,05,02,             53,05,sun,sleep,2000,03,05,02,53,23,18,1min,             3,ftp,tcp,1098,1,195.145.0.130,             195.145.0,192.188.104.22,192.188.104,             s_[sa]_fa_[fa]_[fpa]_fa_[fa]_[fpa]_r,2451609,             in,no,no,no,no,no,yes,no,no,no,no,no      Interpretation: This is a scan for ftp servers      Results:      Machine-learning decision tree (99% training set      accuracy) used here was trained on the same month      as the data used for generating anomalies (September)      Classification Model Accuracy:      High predictive accuracy for initial model: 96%

The fields of network intrusion detection and data mining are just beginning to work together. MITRE research is beginning to demonstrate that the network activity data, whose sheer quantity has been one of the primary challenges to current ID efforts, can be amenable to analysis via a variety of data mining techniques. The application of those techniques has already begun to prove useful in filtering out false alarms and characterizing normal connection pairs. In the near future, data mining should be able to help us understand what normal behavior is for individual host machines and better discriminate network attacks from innocuous activity.