APPLYING DATA MINING TECHNIQUES TO HR INFORMATION SYSTEMS
Many organizations have hurriedly approached data mining as a solution to the problems presented by these large databases. However, caution must be exercised in using data mining. The blind application of data-mining techniques can easily lead to the discovery of meaningless and invalid patterns. If one searches long enough in any data set, it is possible to find patterns that appear to hold but are not necessarily statistically significant or useful (Fayyad, Piatsky-Shapiro, & Smyth, 1996). There has not been any specific exploration of applying these techniques to human resource applications; however, there are some guidelines in the process that are transferable to an HRIS. Feelders, Daniels and Holsheimer (2000) outline six important steps in the data-mining process: 1) problem definition, 2) acquisition of background knowledge, 3) selection of data, 4) pre-processing of data, 5) analysis and interpretation, and 6) reporting and use. At each of these steps, we will look at important considerations as they relate to data mining human resources databases. Further, we will examine some specific legal and ethical considerations of data mining in the HR context.
Problem Definition and Acquisition of Background Knowledge
The formulation of the questions to be explored is an important aspect of the data-mining process. As mentioned earlier, with enough searching or application of sufficiently many techniques, one might be able to find useless or ungeneralizable patterns in almost any set of data. Therefore, the effectiveness of a data-mining project is improved through establishing some general outlines of inquiry prior to starting the project. To this extent, data mining and the more traditional statistical studies are similar. Thus, careful attention to the scientific method and sound research methods are to be followed. A widely respected source of guidelines on research methods is the book by Kerlinger and Lee (2000). A certain level of expertise is necessary to carefully evaluate questions. Obviously, a requirement is data-mining and statistical expertise, but one must also have some intimate understanding of the data that is available, along with its business context. Furthermore, some subject matter expertise is needed to determine useful questions, select relevant data, and interpret results (Feelders et al., 2000). For example, a firm with interest in evaluating the success of an affirmative action program needs to understand the Equal Employment Opportunity (EEO) classification system to know what data is relevant.
Subject matter involvement is also necessary. A compensation specialist, for example, is needed to data mine a payroll system. Once again, the specialist in the area will tend to understand the coding and organization of information that will help in setting up the problem to be evaluated. Typically, when an organization has a database large enough to be mined, there is not necessarily one individual in the organization that is an expert in every area. By seeking out a specialist, the firm can ensure that the proper expertise is available.
Another important consideration in the process of developing a question to look at is the role of causality (Feelders et al., 2000). A subject matter expert's involvement is important in interpreting the results of the data analysis. For example, a firm might find a pattern indicating a relationship between high compensation levels and extended length of service. The question then becomes, do employees stay with the company longer because they receive high compensation? Or do employees receive higher compensation if they stay longer with the company? An expert in the area can take the relationship discovered and build upon it with additional information available in the organization to help understand the cause and effect of the specific relationship identified.
Once the proper expertise is available, the next step is to formulate a question for the data mining to explore. There are many questions for which HR professionals and company management would like to find answers. Table 3 provides some examples of questions organizations may consider important to their business operations. These are questions to which an HRIS and data mining could provide answers. These are just a few examples; the possibilities are plentiful.
Every organization has different data stored in its HRIS and different organizational-related information needs. The interest in relationships may depend upon the market in which a company is competing,, or its current growth cycle stage. A company that is still growing may focus on different aspects than would a mature company. For example, a growing company may focus on identifying effective recruiting techniques. A mature company that is looking to retain its already experienced staff may focus more on benefits and quality of life for its personnel. Following this discussion, we will walk through some specific examples of HR practices about which an organization may want to seek additional information.
Selection and Pre-processing of Data
Selecting and preparing the data is the next step in the data-mining process. Some organizations have independent Human Resource Information Systems that feature multiple databases that are not connected to each other. This type of system is sometimes selected to offer greater flexibility to remote organizational locations or sub-groups with unique information needs (Anthony, Perrewe, & Kacmar, 1996). The possible inconsistency of the design of the databases could make data mining difficult when multiple databases exist. Data warehousing can prevent this problem, and an organization may need to create a data warehouse before it begins a data-mining project. However, these kinds of editing can not be avoided altogether and similarly constitute a step in developing a data warehouse or datamart. The advantage gained in first developing the data warehouse or mart is that most of the data-editing work is done at the start.
Another challenge in mining data is dealing with the issues of missing or noisy data. Data quality may be insufficient if data is collected without any specific analysis in mind (Feelders et al., 2000). This is especially true for human resource information. Typically when HR data is collected, the purpose is some kind of administrative need such as payroll processing. The need of data for the required transaction is the only consideration in the type of data to collect. Future analysis needs and the value in the data collected are not usually considered. Missing data may also be a problem, especially if the system administrator does not have control over data input. Many organizations have taken advantage of web-based technology to allow employees to input and update their own data (McElroy, 1991). Employees may choose not to enter certain types of data resulting in missing data. However, a data warehouse or datamart may help to prevent or systemized the handling of many of these problems.
If a data warehouse is not an economical solution for an organization, it still needs to properly prepare the data for analysis along the same lines. This step of data cleaning and pre-processing includes removing noise and deciding on strategies to handle missing data (Fayyad, Piatsky-Shapiro, & Smyth, 1996). Cleaning data may include steps such as ensuring that proper coding is used or making sure that employee identification numbers are correct. The type of data to be cleaned depends on the question being asked. Cleaning the data can be a big project, and, therefore, consideration of the end result is important.
Analysis and Interpretation
There are many types of algorithms in use in data mining. The choice of the algorithm depends on the intended use of the extracted knowledge (Brodley, Lane, & Stough, 1999). The goals of data mining can be broken down into two main categories. Some applications seek to verify the hypothesis formulated by the user. The other main goal is the discovery or uncovering of new patterns systematically (Fayyad, Piatsky-Shapiro, & Smyth, 1996). Within discovery, the data can be used to either predict future behavior or describe patterns in an understandable form. A complete discussion of data-mining techniques is beyond the scope of this chapter. However, what follows is a fairly extensive survey of some techniques that have so far been used or have the potential to be applicable for data mining of human resources information.
Clustering and classification is an example of a set of data-mining techniques borrowed from classical statistical methods that can help describe patterns in information. Clustering seeks to identify a small set of exhaustive and mutually exclusive categories to describe the data that is present (Fayyad, Piatsky-Shapiro, & Smyth, 1996). This might be a useful application to human resource data if an organization was trying to identify a certain set of employees with consistent attributes. For example, an employer may want to find out what are the main categories of top performers for its employees with an eye towards tailoring various programs to the groups or further study of such groups. One category may be more or less appropriate for one type of training program. Another category may be similarly targeted for various kinds of corporate communication modes, and so on. A difficulty with clustering techniques is that no normative techniques are known that specify the correct number of clusters that should be formed. In addition, there exist many different logics that may be followed in forming the clusters. Therefore, the art of the analyst is critical. Similarly, classification is a data-mining technique that maps a data item into one of several predefined classes (Fayyad, Piatsky-Shapiro, & Smyth, 1996). Classification may be useful in human resources to classify trends of movement through the organization for certain sets of successful employees. A company is at an advantage when recruiting if it can point out some realistic career paths for new employees. Being able to support those career paths with information reflecting employee success can make this a strong resource for those charged with hiring in an organization. Factor Analysis can also be mentioned here as it is sometimes described as clustering of variables (Kerlinger & Lee, 2000) instead of observations. If many measures exist for some desirable employee trait, factor analysis may help to reduce them to a few manageable factors.
Decision Tree Analysis, also called tree or hierarchical partitioning, is a somewhat related technique but follows a very different logic and can be rendered somewhat more automatic. Here, a variable is chosen first in such a way as to maximize the difference or contrast formed by splitting the data into two groups. One group consists of all observations having a value higher than a certain value of the variable, such as the mean. Then, the complement, namely those lower than that value, becomes the other group.
Then, each half can be subjected to successive further splits with possibly different variables becoming important to different halves. For example, employees might first be split into two groups above and below average tenure with the firm. Then, the statistics of the two groups can be compared and contrasted to gain insights about employee turnover factors. A further split of the lower tenure group, say based on gender, may help prioritize those most likely to need special programs for retention. Thus, clusters or categories can be formed by binary cuts, a kind of divide and conquer approach. In addition, the order of variables can be chosen differently to make the technique more flexible. For each group formed, summary statistics can be presented and compared. This technique is a rather pure form of data mining and can be performed in the absence of specific questions or issues. It might be applied as a way of seeking interesting questions about a very large datamart. As another example, if applied to an HR datamart, one might notice that employees hired from a particular source have a higher average value on some desirable trait. Then, with a more careful random sampling and statistical hypothesis testing, the indicated advantage might be tested for validity. Also, the clusters or segments identified by either this approach or other clustering techniques can be further analyzed by another technique such as correlation and regression analysis.
Regression and related models, also borrowed from classical statistics, permits estimation of a linear function of independent variables that best explains or predicts a given dependent variable. Since this technique is generally well known, we will not dwell on the details here. However, data warehouses and datamarts may be so large that direct use of all available observations is impractical for regression and similar studies. Thus, random sampling may be necessary to use regression analysis. Various nonlinear regression techniques are also available in commercial statistical packages and can be used in a similar way for data mining. Recently, a new model-fitting technique was proposed in Troutt, Hu, Shanker, and Acar (2001). In this approach, the objective is to explain the highest or lowest performers, respectively, as a function of one or more independent variables.
Neural Networks may be regarded as a special type of nonlinear regression models. Special purpose data-mining software typically provides this option for model building. One may apply it in much the same way that regression would be used in the case of one dependent variable. However, neural networks have the additional flexibility to handle more than one dependent variable to be predicted from the same set of independent variables.
Virtually any statistical or data analysis technique may be potentially useful for data-mining studies. As noted above, however, it may be necessary to create a smaller sample by random sampling, rather than attempting to apply the technique directly to the entire set of available data.
Reporting and Use
The final step in the process emphasizes the value of the use of the information. The information extracted must be consolidated and resolved with previous information and then shared and acted upon (Fayyad, Piatsky-Shapiro, Smyth, & Uthurusamy, 1996). Too often, organizations go through the effort and expense of collecting and analyzing data without any idea of how to use the information retrieved. Applying data-mining techniques to an HRIS can help support the justification of the investment in the system. Therefore, the firm should have some expected use for the information retrieved in the process.
As mentioned earlier, one use of human resource related information is to support decisionmaking in the organization. The results obtained from data mining may be used for a full range of decision-making steps. It can be used to provide information to support a decision, or can be fully integrated into an end-user application (Feelders et al., 2000). For example, a firm might be able to set up decision rules regarding employees based on the results of data mining. It might be able to determine when an employee is eligible for promotion or when a certain work group should be eligible for additional company benefits.
Legal and Privacy Issues
Organizational leaders must be aware of legislation concerning legal and privacy issues when making decisions about using personal data collected from individuals in organizations (Hubbard, Forcht, & Thomas, 1998). By their nature, systems that collect employee information run the risk of invading the privacy of employees by allowing access to the information to others within the organization. Although there is no explicit constitutional right to privacy, certain amendments and federal laws have relevance to this issue as they provide protection for employees from invasion of privacy and defamation (Fisher, Schoenfeldt, & Shaw, 1999). Organizations can protect themselves from these employee concerns by having solid business reasons for any data collection from employees.
There are also some potential legal issues if a firm uses inappropriate information extracted from data mining to make employment-related decisions. Even if a manager has an understanding of current laws, he or she could still face challenges as laws and regulations constantly change (Ledvinka & Scarpello, 1992). An extreme example that may violate equal opportunity laws is a decision to hire only females in a job classification because the data mining uncovered that females were consistently more successful.
One research study found that an employee's ability to authorize disclosure of personal information affected their perceptions of fairness and invasion of privacy (Eddy, Stone, & Stone-Romero, 1999). Therefore, it is recommended that firms notify employees upon hire that the information they provide may be used in data analyses. Another recommendation is to establish a committee or review board to monitor any activities relating to analysis of personal information (Osborn,1978). This committee can review any proposed research and ensure compliance with any relevant employment or privacy laws.
As noted in the above study (Eddy et al., 1999), often the employee's reaction to the use of his or her information is based upon personal perceptions. If there is a perception that the company is analyzing the data to take negative actions against employees, employees are more apt to object to the use. However, if the employer takes the time to notify employees and obtain their permission even if there may be no legal consequences the perception of negativity may be removed. Employee confidence is something that employers need to maintain.