Because multiple data sets may be used that are in various transactional formats, extensive data preparation may be required. There are various commercial software products that are specifically designed for data preparation, which can facilitate the task of organizing the data prior to importing it into a data mining tool. The following is a list of some of these vendors:
Amadea from ISoft supports a methodology that splits up a data transformation process into several steps: extraction, cleansing, loading, and reporting. Once the user is connected to data source tables, the tool supplies all the functions predefined for cleansing, such as replacing missing values and putting aside outlying values. The transformation step in itself is supported by a library of operators for normalization, denormalization, binarization, verticalization, aggregation, join, union, selection, etc. Operators are graphically compiled so that the user does not need to type code lines. The resulting tables can be exported under most data formats now available on the market. The tool can be used to create data marts or to develop existing data warehouses according to ISoft.
Figure 2.2: The iManageData interface.
BioComp iManageData supports access to delimited text files, Microsoft Excel workbooks, Microsoft Access databases, Oracle, SQL Server, and other relational databases. Comprehensive arrays of mathematical preprocessing functions are available, including trig functions, moving averages, logs, ratios and many more. The tool can create new binary variables based on ranges and thresholds.
DataManger is an inexpensive program, that allows the user to process and manipulate data via a graphical interface. DataManager reads and writes delimited files, such as comma-separated files (CSV), and also can read data from ODBC data sources. It allows the user to construct a conceptual design for how to process and transform the data. For example, this tool could be used to prepare massive Web server log files.
Data-Audit is a diagnostic tool for market research and database evaluation. Key indicators reported are as follows:
Information content (quality of data representation): What is the information level contained in the dataset?
Data efficiency (real versus potential redundancy of variables): How many variables are needed to represent the information?
Information representation (univariate and multivariate): What variables can be used to represent the information contained?
Datagration is a data cleansing tool that finds and uses data patterns. Datagration finds and fixes data errors at the source system. A configurable Data Discovery Message Gateway can be taught to identify common error patterns and then automatically fix them at the record or field level before moving the data to other enterprise systems or into a data mining tool.
Dataskope is a department-level tool to map, transform, alarm, output, and view high volumes of binary or ASCII input data. Dataskope allows users to convert, decipher, and view any kind of event record, such as call data records or comma-separated variable records. It can also interact with the converted data to set triggers on particular events and execute actions. These actions include replacing or modifying values, showing a dialog box, changing color, and the like. Dataskope can read from files or directly from the telephone switch using a serial link or the API it comes with. The files can be read one by one or by using the polling features, which will poll a directory and start reading files automatically when new files are discovered. The converted files will then transfer automatically into a backup directory for redundancy. This kind of tool can be used in investigations involving call-site files and toll-usage records.
Datawash provides online access to data enhancement services and cleans and dedupes databases for increased deliverability and professional presentation. Datawash is a service requiring no purchase of software and may fit the needs of some ad hoc investigations.
Digital Excavator simplifies and speeds up the preparation of data for modeling using an intuitive GUI for data assembly, exploration, and the transformation of data from multiple sources into a single analysis set. The analysis set consolidates the relevant data into a form required by the modeling techniques. The format consists of one row per object being modeled (e.g., suspect) and numerous columns that represent the variables the miner desires the model to consider in developing its prediction. The tool processes several activities:
Cleansing or implementing the desired approach to address the identified quality problems.
Consolidating or integrating the various raw data sets into a single set. This may involve joins or merges and denormalization. In addition, various aggregations may be calculated to summarize multiple events (represented as multiple records in different sets) to a number of variables represented on a single row.
Calculating variables, which are sometimes more useful in producing a good model than the individual components that exist in the raw data. The tool automates the process of creating ratios.
GritBot is a data preparation tool that tries to find anomalies in data as a precursor to data mining. It can be thought of as an autonomous data quality auditor that hunts for records having "surprising" values of nominal (discrete) and/or numeric (continuous) attributes. Values need not stand out in the complete dataset. GritBot searches for subsets of records in which the anomaly is apparent.
IBM Datajoiner allows the user to view IBM, multivendor, relational, nonrelational, local, remote, and geographic data as local and access and join tables without knowing the source location. With a SQL statement, the user can access and join tables located across multiple data sources, such as IBM DB2, Informix, Microsoft SQL Server, Oracle, Sybase SQL Server, Teradata, and others.
Relational Tools is a testing toolset that allows the user to move, edit, and compare complex sets of related data. These tools support the leading database management systems in both the mainframe and client/server environments and always respect the data-access rights defined in the database. With Relational Tools the user can extract and move data between production and test environments. The user can edit and browse complex relational data and compare images of relational data.
Sagent provides a suite of data transformation and loading tools that can implement both Extract, Transform, and Load (ETL) and Extract, Transform, and Present (ETP) on large SMP servers running Solaris and Windows 2000. It includes a component-based development model that connects analytic objects together to easily develop scalable, reusable algorithms that perform address standardization, geo-coding, and spatial determination.
SyncSort is a sort and data manipulation product that provides speed, efficiency, the ability to handle a wide variety of data and file types, and a full set of versatile features. It speeds ETL applications and facilitates data mining and the processing of clickstream data from Web sites.