2.12 Interrogating the Data
As previously mentioned, it is recommended that any analysis start with a sampling of all the data that will eventually be used for the construction of a model or profile. Data mining is exploratory by its very nature in that it is not a process of creating reports. The analyst is commonly searching for clues, ratios, attributes, patterns, characteristics, features, trends, and other telltale intelligence stored in large databases.
The process is iterative, leading to correlations and insights often leading to further mining. Once a particular area or finding is discovered, additional information may be required to confirm a hypothesis. Data mining is an interactive process that often leads to more interesting findings. As we mentioned in Chapter 1, it is very much like the process of criminal profiling, requiring some heuristics skills. It is not an exact science. One of the key items to consider during the data interrogation process is documenting how it was conducted. The process may need to be duplicated by other investigators or analysts. In addition, as with other forensic analyses and findings, the results may end up in court.
2.13 Data Integration
Data derived for criminal detection and security deterrence in most instances will be from distributed locations—from miscellaneous internal databases and third-party sources. Prior to the development of a profile or a model, the data will need to be sampled, extracted, moved, integrated, and converted into a format that can be imported into a data mining tool. During these processes, certain data mining preparations will need to be performed. Most of the data in today's databases is not designed for analysis, profiling, or modeling. It is generally maintained for reporting and queries or billing and accounting. There are also some data integration issues most data mining projects must cope with, such as the following:
The location of the data
The computer platform
The level of security
The access process
The type of media
The type of query
The data format
The data source
A key issue during the data integration process is how the connection to the data will take place: Will ODBC or CORBA be used? How will access to the tables residing in Sybase, Oracle, Informix, SAS, Excel, or IBM servers take place? Will the data be found in flat or fixed-length files? Even flat files have different features that need to be resolved. Some are delimited by commas while other are not. There is the issue of multiple formats, such as relational databases, hierarchical structures, free-text (such as e-mail data), ASCII, field-delimited, or fixed-length format. All of these data integration issues must be considered at this juncture, prior to any data mining analyses.
As part of the data integration process, it may be necessary to deal with multiple operating systems, such as UNIX, Linux, and NT as well as multiple platforms, such as PCs, workstations, servers, and mainframes that support different access protocols. All of this will require dealing with different interfaces. Additionally, the structure of the data can be affected by the operating system, such as with end-of-line characters. These data-integration issues can often constitute the bulk of the effort and time put into some data mining projects. Some of these issues can be further complicated when dealing with proprietary, customized information systems, such as those used by large governmental agencies.
The data required for analysis may reside in multiple locations, which may mean that these sources must be accessed via LANs, WANs, Intranet, dial-up, wireless, Internet, or proprietary closed, secured networks. This may mean that control and access is by a third party and the data is not in a centralized repository or data warehouse. This is yet another data-integration issue that may limit the type of information available for some investigative data mining projects.
Most data mining projects typically involve the use of structured data in relational tables or flat-file formats. However, there may be situations where data has to be retrieved from terminal screen captures using special scripts or table creation software. Although uncommon, this may well be another integration issue, which requires the use of schemes for submitting multiple queries in order to retrieve all of the desired information, such as, for example, capturing the screen data from a point-of-entry terminal used by immigration or customs personnel.
Last is the thorny issue of integrating multimedia formats, involving unstructured free-text data, as well as images, audio, video, e-mail, wireless data, and other binary objects. For counter-intelligence analyses, which need to deal with these types of information objects and formats, this is a very real data-integration and analysis issue. The single best way of dealing with this obstacle is to ensure there is a consistent framework established for this type of data mining project so that all objects with a given class are consistent with each other. This can be a real challenge when dealing with time-sensitive analyses and a need to implement solutions in real time.