11 Data Mining Using PivotTable Service | Data Mining with Microsoft[r] SQL Server[tm] 2000 Technical Reference

‚ accuracy ‚ When referring to data, accuracy is the percentage of values that can be verified to be correct within the records. In data-mining models, it's the degree to which the data-mining model reflects the values of the underlying data, which should serve to measure how well the model can be used to predict new information.

‚ algorithm 1 ‚ Computer logic designed to be used as a framework from which to build computer programs. These are combinations of mathematical formulas and methods designed to solve a particular problem. Data-mining algorithms seek to transform cases into a specific predictive model.

‚ analytical model ‚ The logical data structure that is the result of a process, usually an algorithm. Clustering and decision trees are examples of models used for the grouping and classification of data.

‚ anomalous data ‚ Case data that contains errors that are generally caused by bad data-entry validation or corrupted processes. These errors can cause inaccurate data-mining models and false or missing predictions .

‚ antecedent ‚ When one event is identified as the cause of or link to another event, the original event is described as the antecedent. In a department store, if 45 percent of customers who buy motor oil come back the next day to buy oil stain remover, the purchase of the motor oil is the antecedent. See also left-hand side. 2

‚ API ‚ See application programming interface

‚ application programming interface (API) ‚ Some applications or software tools contain internal functions that can be used by outside programs without the need to delve into the source. In order to provide this functionality, the software exposes these functions through an interface called the API, which allows the outside software developers to incorporate these calls in their own software.

‚ artificial neural network ‚ Nonlinear predictive models that learn through training and are designed to imitate the function of biological neural networks in the human brain. Artificial neural networks are used in data-mining processes involving very large quantities of variables that each get analyzed in a separate node of the neural network. This is a directed form of data mining that uses inputs to predict future outcomes .

‚ association algorithm 3 ‚ An algorithm that creates rules that identify how often events have occurred together. This fits in with the "beer and diapers " example in which a supermarket finds that the purchase of beer occurs at the same time as the purchase of diapers x percent of times.

‚ bias ‚ This refers to the inclination of the cases to lean to a particular description because of the way the data was sampled. For instance, if data from a survey was used to build a data-mining model, it might be found that persons who answer surveys have built-in differences from those who don't. This can introduce bias in the results. Bias can also occur when data is extracted from a database using criteria that eliminates the randomness of the sample, such as using zip codes, income brackets, or even dates and names .

‚ binning ‚ A process that converts numerical data that would otherwise be continuous into discrete values that are placed in predefined " bins . " For instance, if income were to be used as an output variable, the algorithm might need to bin the data because there are too many different values between the highest and the lowest incomes. For efficiency, the algorithm places the incomes in bins of 10,000 ‚ X30,000, 30,000 ‚ X50,000, 50,000 ‚ X80,000, and so on.

‚ bootstrapping 4 ‚ When data sets are too large, a process called bootstrapping extracts samples from the original data set and treats each of the samples as the entire population. Because bias can be introduced this way, several samples are then used to create several data models, each of which gets averaged into one.

‚ CART ‚ See classification and regression trees

‚ cases ‚ Records containing data that are used to populate a data-mining model. The cases represent the inputs and outputs of the subsequent analysis or predictions that are to be made.

‚ categorical data 5 ‚ Categorical data fits into a small number of discrete categories as opposed to continuous, numerical data. Categorical data can either be unordered (nominal), referring to countries or zip codes, or ordered (ordinal), such as high, medium, or low values.

‚ CHAID ‚ See Chi-squared Automatic Interaction Detector

‚ chi-squared ‚ A statistic that finds the values that are most predominant in a given set. It uses this information to determine the best splits to use in a decision tree, usually created with a CHAID-type algorithm.

‚ Chi-squared Automatic Interaction Detector (CHAID) ‚ A decision tree algorithm that uses the chi-squared statistic to identify the most efficient values to use as splits in the data. 6

‚ classification ‚ The process of dividing cases into mutually exclusive groups such that the members of each group are as close as possible to one another, and different groups remain as far as possible from one another. The distance is measured with respect to specific variables you are trying to predict. Classification might seek to divide the entire data sample or population for a credit card company into two groups, those customers with "Good" credit and those with "Bad" credit.

‚ classification and regression trees (CART) ‚ A decision tree technique used for classification of a data set. It provides a set of rules that can be applied to new cases to predict which records will have a given outcome. This technique uses two-way splits only, which increases depth, but requires less preparation by the engine.

‚ classification tree ‚ A decision tree that places categorical variables into individual classes.

‚ cleansing 7 ‚ This is the task that processes the data to be used to build cases in an effort to eliminate all errors and anomalies before the data actually gets used to build a data-mining model. This process can also complete missing information and normalize certain attributes so that they can be treated in the same manner.

‚ clustering algorithm ‚ Used to find groups of items with similar attributes. They are used to draw conclusions about the behavior of a given group from which accurate generalizations about the members of the groups are made.

‚ confidence ‚ Confidence of a rule is a measure of how likely it is that the rule applies. This is usually based on a number of factors, including the number of cases used to derive that rule as well as the representation of that rule in relation to the others. For instance, the chance of an IRS audit expressed as a rule is 65 percent, and the percent of Gold credit card holders, expressed as a rule, is 85 percent.

‚ confusion matrix ‚ A table that compares the actual values vs. the predicted values of a data-mining process. This is most often used to gauge the accuracy of a given data-mining model. 8

‚ consequent ‚ When one event is identified as the cause of or link to another event, the caused event is described as the consequent. In a department store, if 45 percent of customers who buy motor oil come back the next day to buy oil stain remover, the purchase of the oil stain remover is the consequent.

‚ continuous data ‚ Refers to numerical values that have relative values to each other. Time data, such as years , and income data are examples of continuous values as opposed to discrete or categorical values.

‚ cross validation ‚ A method of estimating the accuracy of a decision tree. The original data set is divided into parts, each playing a specific role in testing the decision tree model against the other parts .

D 9

‚ data ‚ Facts and events gathered through manual or automatic data entry. These facts are usually stored in a database in the form of records. Groups of related records are known as cases.

‚ database management system (DBMS) ‚ An application that?sdesigned to manage data, relational, or simple flat file tables.

‚ data format ‚ Data format refers to the form of the data in the database. Data items can exist in many formats such as text, integer, and floating-point decimal.

‚ data mining 10 ‚ An information extraction activity that uses a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow predictions of future results to be made.

‚ data-mining method ‚ Procedures and algorithms used to analyze the data in databases.

‚ data navigation ‚ The process of viewing different dimensions and levels of detail of a multidimensional OLAP structure. This is often used to confirm or deny assumptions made about the data.

‚ data visualization ‚ Visual interpretation of complex relationships in multidimensional data. 11

‚ data warehouse ‚ A database designed for optimizing the storage and delivery of massive quantities of data, oftentimes in a different format than the transactional data.

‚ DBMS ‚ See database management system

‚ decision tree ‚ A tree-shaped structure that represents a set of decisions expressed as nodes that contain rules and descriptions of their members. These nodes generate rules for the classification of a data set.

‚ deduction 12 ‚ Deduces information that is a logical and irrefutable consequence of the data. For example if A = B and B = C, then one can deduce that A = C.

‚ degree of fit ‚ A measure of how closely the data-mining model fits the training data. Usually a statistical measure known as r-squared is used.

‚ dependent variable ‚ The variable derived by the prediction task associated with the data-mining model. It's the outcome or output variable.

‚ deployment ‚ The implementation of a trained and completely validated model for use as a base for predictions. 13

‚ dimension ‚ Each attribute of a case or occurrence in the data being mined that affects the value of a given output variable. Usually represented by a field in a flat file record or a column of a relational database table.

‚ discrete ‚ A data item that has a finite set of values, such as colors or days of the week. Discrete is the opposite of continuous.

‚ discriminant analysis ‚ A statistical method used to determine the location of boundaries between clusters. Used mostly in clustering type processes.

E 14

‚ entropy ‚ A way to measure variability other than with the variance statistic. Depending on the type of decision tree algorithm, this value can be used to decide upon a split in a given node.

‚ exploratory analysis ‚ Looking at data to discover relationships not previously detected . This is usually a step done in preparation for data mining because it offers the chance to understand the patterns and trends in the original data.

‚ external data ‚ Data not collected by the organization, such as data available from database feeds originating from other companies.

F 15

‚ feed-forward ‚ A neural net in which the signals flow in only one direction, from the inputs to the outputs.

‚ fuzzy logic ‚ Applied to fuzzy sets where membership in a fuzzy set is a probability and not necessarily a definite "True" or "False. " Fuzzy logic needs to be able to manipulate degrees of probability of truth in addition to true and false.

‚ genetic algorithm ‚ A method of generating and testing combinations of possible input parameters to find the optimal output. It uses processes based on the principles of natural selection as the criteria for determining truth. 16

‚ graphical user interface (GUI) ‚ An environment in which the user issues commands by clicking buttons and choosing options from menus and lists instead of typing commands on the command line.

‚ GUI ‚ See graphical user interface

‚ hidden nodes ‚ The nodes in the hidden layers in a neural network. 17

‚ independent variable ‚ The attributes and values used to determine what the prediction or output variables will be.

‚ induction ‚ A technique that infers generalizations from the information in the data. The process involves deriving a conclusion based on repetitive patterns rather than irrefutable logical principles.

‚ interaction ‚ Two independent variables interact when a change in the value of one changes the effect on the dependent variable of the other. 18

‚ internal data ‚ Data collected by an organization through its own data-entry procedures.

‚ k- nearest neighbor ‚ A clustering method that classifies a point in space by calculating the distances between the point and points in the training data set. The point is then assigned to the class that is most common among its k-nearest neighbors or clusters.

‚ Kohonen feature map ‚ A type of neural network that uses unsupervised learning to find patterns in data. In data mining, it is employed for cluster analysis. 19

‚ layer ‚ Nodes in a neural net are usually grouped into layers, with each layer performing an input, output, or hidden function. The input nodes match input variables and the output nodes match output variables.

‚ leaf ‚ In a decision tree, a leaf represents the very last node in the tree that has no more branches emanating from it. It's the part of the tree that is used to make the predictions.

‚ learning ‚ The process of training a data-mining model with existing data. 20

‚ least squares ‚ A method of training the parameters of a model by choosing the weights that minimize the sum of the squared deviation of the predicted values of the model from the observed values of the data.

‚ left-hand side ‚ See antecedent

‚ linear model ‚ An analytical model that assumes linear relationships between the values of the variables being examined.

‚ linear regression 21 ‚ A statistical technique used to find the most accurate linear relationship between a target variable and its input variables.

‚ logistic regression ‚ A linear regression that predicts the proportion of categorical target variables, such as the type of customer in a given population. Logistic regression is useful when the observed outcome is restricted to two values, which usually represent the occurrence or non-occurrence of an event.

‚ massively parallel processing (MPP) ‚ A computer configuration that is able to use hundreds or thousands of CPUs simultaneously . The system usually consists of various computers processing different parts of a program at the same time. Each computer, or node, maintains control of its own resources while maintaining a high level of coordination with the other nodes. This structure is often used for neural networks.

‚ maximum likelihood 22 ‚ A training method that finds the maximum likelihood that a parameter is the value that maximizes the probability that the data came from the population defined by the parameter.

‚ mean ‚ The arithmetic average value of a collection of numbers .

‚ median ‚ The value in the middle of a collection of ordered data. In other words, the value with the same number of items above and below it.

‚ missing data ‚ Data values that were not measured, not entered into the system, or were simply unknown. Typically, data-mining methods ignore the missing values, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values. 23

‚ mode ‚ The most common value in a data set. If more than one value occurs the same number of times, the data is multimodal.

‚ model ‚ A logical structure that contains data that represents the patterns discovered in the cases used to build the model. This model is then used for browsing in an effort to better understand the underlying data or as a repository of rules that can be applied to new data in order to predict unknown outcomes.

‚ MPP ‚ See massively parallel processing

‚ multidimensional database 24 ‚ A database designed for OLAP. Structured as a multidimensional cube with aggregates precomputed per dimension.

‚ multiprocessor computer ‚ A computer that includes multiple processors that work at the same time.

‚ nearest neighbor ‚ A clustering technique that makes use of a k-means algorithm. This assumes that the clusters are known in advance, which then allows each record to be evaluated according to its distance from the center of the cluster.

‚ neural network 25 ‚ See artificial neural network

‚ node ‚ A decision point in a decision tree usually containing a specific set of rules governing membership in that node.

‚ noise ‚ Nonsensical data or erratic predictions that are often the result of false or even nonexistent patterns. A decision tree that predicts that anybody named Mary is a good credit risk is an example of noise.

‚ nonapplicable data ‚ Values that would be logically impossible , such as high-income- earning pets or married plants. 26

‚ nonlinear model ‚ An analytical model that does not attempt to establish any linear relationships between the variables being examined.

‚ normalize ‚ The process of culling the cases of all extremes in the data that would adversely affect the predictive ability of the model.

‚ object linking and embedding database (OLE DB) ‚ See OLE DB 27

‚ OLAP ‚ See online analytical processing

‚ OLE DB ‚ An acronym for object linking and embedding database. A COM interface that allows a program to be written that queries data without having to take the implementation and structure of the underlying data into consideration.

‚ online analytical processing (OLAP) ‚ Tools that give the user the capability to browse through multiple levels of aggregation of the data with relative ease.

‚ optimization criterion 28 ‚ A function of the difference between predictions and data estimates that are derived and chosen to optimize the predictive capability of the model. Least squares and maximum likelihood are examples of these functions.

‚ outliers ‚ Data that falls outside the boundaries of what's normally expected from the value. When predicting income, billionaires might be taken out of the case set because of their unusually large incomes and the low number of billionaires.

‚ overfitting ‚ A tendency of some modeling techniques to pursue the training process to the point where patterns contain very few occurrences yet represent very specific predictive recommendations, which because of their low occurrence can cause false outcomes.

‚ overlay ‚ Cases from outside sources of data that are combined with the organization's own data. 29

‚ parallel processing ‚ The coordination of multiple processors that are used to perform computational tasks with a common goal. Parallel processing can occur either on a multiprocessor computer or on a network of computers, each of which accomplish a part of the task.

‚ pattern ‚ A relationship between two or more variables. Data-mining techniques include automatic pattern discovery that makes it possible to detect complicated relationships in data.

‚ precision ‚ The measure of variability of a given attribute when arriving at a certain outcome. Although the result might not be accurate, it can be precise in its ability to pinpoint the exact outcome of true or false. 30

‚ predictability ‚ See confidence

‚ prediction query ‚ A SQL-like syntax that allows a program or a user to input cases with unknown outputs and compare them to a data-mining model, which then "fills in the blanks."

‚ predictive model ‚ A structure and process for predicting the values of specified variables in a data set.

‚ prevalence 31 ‚ The percentage of times that a given association occurs in a node or a cluster. This can be expressed as "90 percent of members use Gold credit cards in node xyz."

‚ prospective data analysis ‚ Data analysis that predicts future trends, behaviors, or events based on historical data.

‚ pruning ‚ The process of eliminating nodes from a decision tree when it has been determined that those nodes do not reflect accurate predictive descriptions.

‚ RAID 32 ‚ See Redundant Array of Inexpensive Disks

‚ range ‚ The difference between the maximum value and the minimum value.

‚ RDBMS ‚ See Relational Database Management System

‚ Redundant Array of Inexpensive Disks (RAID) ‚ A technology for the efficient parallel storage of data for high-performance computer systems. 33

‚ regression tree ‚ A decision tree that predicts values of continuous, numerical variables.

‚ Relational Database Management System (RDBMS) ‚ Application that is designed to manage data, but unlike a simple DBMS, it also manages the relations and constraints between tables.

‚ resubstitution error ‚ The estimate of error based on the differences between the predicted values of a trained model and the observed values in the training set.

‚ retrospective data analysis 34 ‚ Data analysis that provides insights into trends, behaviors, or events that have already occurred in the past in an effort to perform exploratory analysis of the data.

‚ right-hand side ‚ See consequent

‚ r-squared ‚ A number between 0 and 1 that measures how well a model fits its training data. One represents a perfect fit and 0 says the model has no predictive ability. The formula consists of a calculation of the covariance between the predicted and observed values divided by the standard deviations of the predicted and observed values.

‚ rule induction ‚ The extraction of useful "if-then rules" from data based on statistical significance. 35

‚ sampling ‚ Creating a subset of data from the whole population.

‚ sensitivity analysis ‚ Experimenting with the parameters of a model by changing their values and assessing the effects in predictive output.

‚ sequence discovery ‚ The same as association algorithm, except that the time sequence of events is also considered . So if an event occurs after another event, the model is also able to predict how long after the first event the second will occur. 36

‚ significance ‚ A probability measure of how strongly the data supports a certain output. For instance, if the significance of a result is said to be 10 percent, it means that there is only a 10 percent chance that the result could have occurred by some coincidence . The lower the significance, the better the output is because it is less likely that the event was unrelated to the input variables.

‚ SMP ‚ See symmetric multiprocessor

‚ supervised learning ‚ Process by which an algorithm trains data and builds a model based on known input parameters. This is used when you know what you are looking for, and you need to apply the model to your data.

‚ support 37 ‚ See confidence

‚ symmetric multiprocessor (SMP) ‚ A type of multiprocessor computer in which simultaneous processing tasks are shared equally among all the processors. Often this feature is a function of the operating system.

‚ terabyte ‚ One trillion bytes or 1000 gigabytes.

‚ test data 38 ‚ A case set that was not used to build the data-mining model. Using this data, it's possible to assess the accuracy of the model.

‚ test error ‚ The estimate of error based on the results of applying the test data to the data-mining model.

‚ time series analysis ‚ The analysis of a sequence of measurements made at specified time intervals. Time is the most significant dimension in this type of analysis.

‚ time series model ‚ A model that forecasts future values of a time series based on past values. The predictions are made based on values as they change over time, so the predictions sought also require time values as a variable. 39

‚ topology ‚ For a neural net, topology refers to the number of layers and the number of nodes in each layer.

‚ training ‚ The process of creating the initial data-mining model is referred to as "training the model." The end result is a model that accurately represents the trends and patterns as they exist in the population of data.

‚ training data ‚ The set of cases used to train the model.

‚ transformation 40 ‚ A structuring and re- expressing of data such as aggregating it, normalizing it, changing its unit of measure, or cleansing it.

‚ unsupervised learning ‚ Refers to the collection of techniques in which groupings of the data are defined without the use of a dependent variable. This type of process is used when the outputs are essentially unknown. Cluster analysis is an example.

‚ validation ‚ The process of testing the models with a data set other than the set from the original training cases. By using a different set of cases, the model is put to the test of accurately predicting values that were not initially known at training time. 41

‚ variance ‚ A statistical measure of dispersion. The initial deviations of a data item from its average value are squared, and then the average of those squared deviations is obtained to get the overall measure of variability.

‚ visualization ‚ Data-mining models can be viewed by representations displayed in a variety of graphs and charts . These tools are designed to eliminate some of the complexity surrounding the understanding of the model. For instance, a decision tree can be viewed as a series of interconnected nodes. A clustering model can be viewed using a scatter point chart to see the quantity and density of clusters.

‚ windowing ‚ When training a model using time series data, the window refers to the period of time that will be used to make a subsequent prediction. To predict a value for next week, I may need to use a window of five weeks' worth of time-series data. 42