You can use an approach known as fuzzy logic to estimate a project's size in lines of code (Putnam and Myers 1992, Humphrey 1995). Estimators are usually capable of classifying features as Very Small, Small, Medium, Large, and Very Large. We can then use historical data about how many lines of code the average Very Small feature requires, how many lines of code the average Small feature requires, and so on to compute the total lines of code. Table 12-1 shows an example of how such an estimate might be created.
Feature Size | Average Lines of Code per Feature | Number of Features | Estimated Lines of Code |
---|---|---|---|
Very Small | 127 | 22 | 2,794 |
Small | 253 | 15 | 3,795 |
Medium | 500 | 10 | 5,000 |
Large | 1,014 | 30 | 30,420 |
Very Large | 1,998 | 27 | 53,946 |
TOTAL | - | 104 | 95,955 |
The entries in the Average Lines of Code per Feature column in the table should be based on your organization's historical data and are fixed before the estimation begins. The Number of Features column is a count of how many features you have classified into each size category. The Estimated Lines of Code column is computed from the other two columns. As shown, the estimate has 5 significant digits, which is well beyond the accuracy of the underlying numbers. If I were presenting this estimate, I would present it as "96,000 lines of code" or even "100,000 lines of code" (that is, to one or two significant digits) to avoid using too much precision and conveying a false sense of accuracy.
Fuzzy logic works best when the sizes are calibrated from your organization's historical data. As a rule of thumb, the differences in size between adjacent categories should be at least a factor of 2. Some experts recommend a factor of 4 difference (Putnam and Meyers 1992).
You should create the initial size averages by classifying completed work from one or more completed systems. Go through the past system and classify each feature as Very Small, Small, Medium, Large, or Very Large. Then count the total number of lines of code for the features in each classification and divide that by the number of features to arrive at the average lines of code for each feature classification. Table 12-2 shows an example of how this might work out.
Size | Number of Features | Count of Total LOC | Average LOC |
---|---|---|---|
Very Small | 117 | 14,859 | 127 |
Small | 71 | 17,963 | 253 |
Medium | 56 | 28,000 | 500 |
Large | 169 | 171,366 | 1,014 |
Very Large | 119 | 237,762 | 1,998 |
The numbers in this table are purely for purposes of illustration. You should work out your own numbers by using your own organization's historical data.
Tip #55 | Use fuzzy logic to estimate program size in lines of code. |
When assigning new functionality to size categories, it's important that the assumptions about what constitutes a Very Small, Small, Medium, Large, or Very Large feature in the estimate are the same as the assumptions that went into creating the average sizes in the first place. You can accomplish this in any of three ways:
Have the same people who are going to create the estimate create the original numbers for the sizes.
Train the estimators so that they classify features accurately.
Document the specific criteria for Very Small, Small, Medium, Large, and Very Large so that estimators can apply the size categories consistently.
One interesting aspect of statistics is that statistical summaries can have more validity than any of the individual data points that make up the summary. As discussed in Chapter 10, "Decomposition and Recomposition," the Law of Large Numbers gives the rolled-up estimate an accuracy above and beyond the accuracy of the individual estimates. The whole is truly greater than the sum of its parts.
When using fuzzy logic, it's important to remember this phenomenon, that the rolled-up number has a validity that the underlying numbers do not have. The reason fuzzy logic works is that we can safely assume that if 71 small features required an average of 253 lines of code in the past, 15 small features will each probably require approximately 253 lines of code in the future. However, the fact that the average is 253 lines of code does not mean that any specific feature will actually consist of 253 lines of code. The sizes of individual Small features could range from 50 lines of code to 1,000 lines of code. So, although the rolled-up estimate produced by fuzzy logic can be surprisingly accurate, you should not overextend the technique to make estimates of sizes of specific features.
By the same token, the fuzzy logic approach works well when you have about 20 features or more. If you don't have at least 20 total features to estimate, the statistics of this approach won't work properly, and you should look for another method.
Fuzzy logic can also be used to estimate effort if you have the underlying data to support it. Table 12-3 shows an example of how that would work.
Size | Average Staff Days per Feature | Number of Features | Estimated Effort (Staff Days) |
---|---|---|---|
Very Small | 4.2 | 22 | 92.4 |
Small | 8.4 | 15 | 126 |
Medium | 17 | 10 | 170 |
Large | 34 | 30 | 1,020 |
Very Large | 67 | 27 | 1,809 |
TOTAL | - | 104 | 3,217 |
The numbers shown in the table are purely for purposes of illustration, and you would need to derive your own Average Staff Days per Feature from your organization's historical data.
The final estimate of 3,217 staff days is again too precise. You could simplify it to 3,200 staff days, 3,000 staff days, or 13 staff years (assuming 250 staff days per year). You can also always consider presenting the number as a range, such as 10 to 15 staff years, which would communicate an entirely different accuracy than would 3,217 staff days.