Data Mining for Discovering Software Effort Patterns | Managing Data Mining Technologies in Organizations: Techniques and Applications

< Day Day Up >

Software development and maintenance represent significant expense to the organization (Banker & Slaughter, 1997). The software development for an organization involves the risk of delays in implementation and inability of an organization to abandon a troubled project (Pendharkar, Rodger, & Kumar, 1997). The delays in a software development effort translate into a weakened competitive position of the company. Further, the strategic nature of software development, the pressure to succeed, and short-term focus lead some software project managers to take risky budget and schedule decisions (Pendharkar et al., 1997).

Although rarely reported due to the confidential nature of the software development, a lot of software development projects fail. Table 1 lists some of the software project failures and amount of loss in U.S. dollars. Most of the failures in software development result from poor design decisions, effort estimation, schedule estimation and budget estimation. In the current research, we focus on using data mining for better software effort estimation.

Table 1: A few major software project failures in US and in UK
Project Name	Amount of Loss in Millions of US Dollars
Confirm CRS Project	$213
PROMS Project	$16.5
London Computer Dispatch Systems	$4.5
London Stock Exchange	$112.5
Source: Flowers, S. (1996)

There are four essential aspects of software development-the tools, methodology, people, and management (Foss, 1993; Subramanian & Zarnich, 1996). Since schedule and budget decisions are based on the software effort estimates, we focus on the impact of the different aspects of software development on the software effort estimation. Subramanian and Zarnich (1996) and Banker and Slaughter (1997) report that software effort depends on software development tools, software development methodology, software developers experience with the development tools and the project size. In reality, software effort depends on several complex variables (including the ones identified by Subramanian and Zarnich [1996] and Banker & Slaughter [1997]) whose relationships are often not very clear. Given the lack of information on the interrelationships of the various variables, it becomes difficult to establish any specific parametric form of software effort dynamics. Based on the review of data mining literature, we believe that a connectionist model can be used to discover the nonparametric and nonlinear relationships among the various predictor variables.

We use the software project data from Subramanian and Zarnich (1996) study to discover the nonlinear nonparametric relationships between the predictor variables. Our nonlinear model is derived from the Subramanian and Zarnich (1997) study and is defined as follows:

Software Effort=f(software development methodology, Tools used, Developers Experience in tools)

Subramanian and Zarnich (1996) data consists of 40 software projects obtained from two major companies in the northeastern U.S. The data set includes an assigned project number (1 through 40), tool name (IEF or INCASE), methodology (RAD or SDLC), tool experience (low, medium, or high), actual effort in man months, adjusted function points (AFP), unadjusted function points (UFP), and technical complexity factor (TCF). Software productivity measured as the number of function points per man month (FPMM) is calculated as AFP divided by effort. The TCF values did not vary much across the projects [mean=0.8923; S.D. = 0.1126], which would suggest that the projects were comparable.

We take the data on 40 software projects and divide them into two sets of 30 projects and 10 projects. We used stratified sampling to break the data into two sets so that, as far as possible, both sets contain similar type of projects. The smaller sample size precludes us from random sampling as we believed that random sampling might hurt representivity (where both data sets can be assumed to be coming out of common distribution) of the two sets. Whenever a large sample is available, we suggest that random sampling would be more appropriate. We use the data set containing 30 projects for training the connectionist model and the data set containing 10 projects represents our test data. The interested reader is directed to Subramanian and Zarnich (1996) for more information on the data set used for the study.

Since we had small sample size both for training as well as for testing, we carefully designed our network. Among the design issues were:

Normalization of Dependent Variable: Since we were using the logistic activation function, it can be shown that:

We normalized our dependent variable of software productivity (y) so that y ∈ (0,1).
Training Sample Size: Our initial decision on training set sample size of 30 was determined on the heuristic of training set size >= 10 times number of independent variables (Weiss & Kapouleas, 1989). Since we had 3 independent variables we selected the training set sample size of 30. We do understand a larger set would be more appropriate but lack of availability of larger data sets made us satisfy the equality constraint.
Learning, Generalizability and Overlearning Issues: The network convergence criteria and learning rate determine how quickly a network learns and how well a network learns. Lower learning rate increases time it takes for network to converge but it does find a better solution. The learning rate was set to 0.08. The convergence criteria was set as follows:
```
 If |Actual_Effort - Predicted_Effort|≤ 0.1 Then     Convergence = Yes.     Else     Training iterations ≥ Maximum iterations     Convergence = Yes. 
```
We selected the above-mentioned convergence criteria to account for high variability of the dependent variable. A more strict convergence criteria was possible but there is an issue of overfitting the network on training data. Overfitting was an issue as there is evidence in literature that overfitting does minimize sum of square error in training set at the expense of the performance on the test set (Bhattacharyya & Pendharkar, 1998). We believed that the above-mentioned convergence criteria can make the network learning more generalized.
Network Structural Issues: The network structure that we chose for our study is shown in Figure 1. We had a two-layer network for modeling a nonlinear relationship between the independent variables and the dependent variable. The number of hidden nodes was twice the number of input nodes+1, as which is a more common heuristic for smaller sample sizes (Bhattacharyya & Pendharkar, 1998). In case of larger sample sizes, higher number of hidden nodes is recommended (Patuwo, Hu, & Hung, 1993). For our research, we tried two different sets of hidden nodes, 7 and 4.
Input and Weight Noise: One of the ways to develop robust neural network model is to add some noise in its input and weight nodes while the network is training. Adding a random input noise makes the network avoid local minima and makes it less sensitive to changes in input values. The weight noise shakes the network and sometimes helps it to jump out from a gradient direction that leads to a local minima. In one of our experiments, we added input variable and weight noise during the network training phase.

The evolutionary model used for forecasting software effort can be represented in general form as:

Software effort = w₁ (development methodology)^w₂ + w₃ (tool)^w₄+w₅ (experience)^w₆

where w_i, ∀i ∈ {1, 2, 3, 4, 5, 6} are the set of coefficients learned by the evolutionary model. For w_i ∈ ℜ, the search space is extremely large and it is extremely hard to find an extremely competitive solution (i.e., the one that does as well as ANN) without a very large computation effort. Thus, it is important to restrict the search space so that competitive solutions can be found in the real time without significant computation effort. The design consideration for the evolutionary model reduces down to deciding on the possible search space of solution vector. In the event where w₂, w₄, and w₆ are equal to 1, the forecasting model becomes a linear forecasting model. In the case of the linear forecasting model, it can be shown that at least one good solution exists when the search space for solution vector is restricted to the real valued closed interval of [-1, 1].

The search space for non-linear model was kept to closed real valued interval of [-1, 1] as well. The reasons for selecting this interval were 1) the non-linear model will at least perform as well as the linear model and 2) the real valued interval will allow the model to incorporate inverse relationships such as reduction in effort with increase in programmer's experience (which means that w₆ was allowed to take negative values).

< Day Day Up >