Section 4.4. What Should We Measure?


4.4. What Should We Measure?

We have three metrics that measure the viability of any business:

  • Throughput The rate at which the system generates sales

  • Inventory The cost of anything you've built (even part way)

  • Operational expense Money spent to turn inventory into throughput

However, these metrics are difficult to connect to the activities of software development. We can start improving them by adding some detail specific to software development organizations:

  • Throughput The rate at which we generate income by selling the software we build

  • Inventory The money spent in development of functionality the customer has not yet purchased

  • Operational expense The cost to complete the development of the functionality required by the customer

We'd like throughput to go up while inventory and operational expense go down. However, these specific measures won't be particularly meaningful to the development team, and the team members won't feel a strong ability to influence them, so we need to go further.

For throughput, we need some measure of how the functionality we are developing relates to sales. Many agile teams have started estimating throughput using Ron Jeffries' Running Tested Features (RTF) metric[4]. This is a count of the number of features currently passing their acceptance tests. It is a great metric for helping your team to focus on producing customer-visible functionality instead of components.

However, RTF does not address the relative values of those features. Each feature adds one to the count regardless of the value the customer places on that feature. When applied within XP or Scrum where the planning process prioritizes customer requests, RTF might be sufficient. However, directly measuring the value of each feature would result in a more accurate estimate of throughput. In order to achieve that, each feature (story, user request, etc.) must be given a measure that reflects how much revenue it will generate. We call this metric "bells" (from the phrase "bells and whistles," meaning features that excite the customer).

Bells is a relative measure on a scale from one to five[4] reflecting the revenue the feature should bring. For a single customer operation, bells reflects the amount the customer would pay for the feature, with one meaning that the customer would pay very little and five meaning that the customer would pay a lot (or wouldn't buy the product without that feature). For a retail product, the number of bells that a feature is given reflects the extent to which the feature will attract new customers or upgrades from existing customers. For example, a feature that distinguishes your product from your competitors' and is highly attractive to your market audience would get five bells; a feature that is useful but unlikely to motivate many people to move to your next release would get one or two.

[4] XP already has the customer ranking the priority of stories on a scale of 1 to 3, but that ranking is not usually used in this manner. The scale of 1 to 5 gives a little more granularity to what is still a relatively subjective measure.

Now that we have a measure of the value of each story, the number of bells delivered per staff day approximates throughput (the rate at which we generate revenue). That's one metric definedtwo to go!

In the context of software development, operational expense is reasonably constant because it is dominated by personnel expenses that are determined almost entirely by the number of engineers on the team. In this case, the question isn't really how much we spend on the team, but how well we use those resources. We can account for operational expense if we have a measure of the efficiency with which we use our engineers. As efficiency goes up, our throughput increases while our operational expense stays the same. This would achieve the goal of an increase in ROI.

In order to plan, every feature (or story) must have an estimate for how long it will take to complete. Two units of measure are used for these estimates: ideal engineer hours and story nuts. Story nuts (which have been called many other things, including gummy bears) is a relative measure of effort, often on a scale of 1, 2, 4, and 8.[5] Ideal engineer hours is the number of hours the engineer estimates it would take if he had no interruptions and the design of the system supported the change he plans to make. In either case, these estimates give us a measure of the ideal effort each story requires. Given either story nuts or ideal engineer hours as ideal effort, our measure of efficiency is ideal effort per total staff days. As an example, the time an engineer spends filling out a complex status report will count as staff days but will not produce any functionality. This will reduce our efficiency.[6] That's two metrics definedonly one more to go!

[5] The logarithmic nature of this scale reflects the fact that people are better at estimating whether one thing is twice as big as another than at estimating on a linear scale.

[6] We'd better be sure that the report is useful because it isn't helping our efficiency!

The third metric we need to worry about is inventory. In software development, inventory is reflected by the amount of time from when we start working on a feature until that feature is delivered. The longer that length of time, the more inventory we will accumulate. Consider what happens as the length of an iteration increases:

  • Our feedback loop from the customer is lengthened. This means the effect of misunderstandings with the customer is increased. We can accidentally spend more time on a feature that isn't exactly what the customer wanted before recognizing the mistake.

  • The size of features we are going to accept is larger, so our estimation errors are going to be bigger. This means that we're likely to have to spend a higher percentage of time replanning.

  • If there is a problem with a feature, the length of time since we wrote the code is proportional to the re-learning that will have to be done to fix the problem. So fixing problems is more costly.

  • If the customer is unable to wait until the next iteration, we may have to replan to incorporate new requests in the middle of an iteration. If the iteration length were small enough, we could defer all requests until the next iteration.

  • Finally, if the iteration length gets too large, the customer's needs change, and he won't want the feature by the time we can deliver it.

These points show that in some ways, efficiency is affected by inventory, but making inventory explicit will help us focus on delivery speed.

Our metric reflecting inventory will be iteration length.[7] Lowering iteration length is equivalent to lowering average inventory. Although short iterations are the ideal, every organization has a limit for the minimum iteration length they can handle. In general, the iteration lengths are not shorter than a week; the overhead of planning and closing out an iteration makes iterations shorter than a week inefficient. However, delivering every week requires skills such as defining small stories, test-driven development, and source code refactoring. When you are transitioning to agility, it is unlikely that your team will have these skills. As they are learning the skills required by agility, the process must continue to have some plan-driven activities, and these will reduce the team's ability to deliver rapidly. This means that the initial iteration length will be longerperhaps a month or six weeks. Tracking iteration length as a metric reflecting inventory will encourage the team to look for ways to improve their ability to deliver quickly.

[7] This matches the manufacturing metric of cycle timethe time it takes to answer a customer request.

We now have three very dynamic metrics:

  • Throughput Bells per staff dayA measure of the rate at which we produce customer value

  • Efficiency Ideal effort completed per staff dayA measure of how efficiently we utilize our resources

  • Iteration length[8] Elapsed time of an iterationA measure of how quickly we can respond to a customer request

All of these metrics measure the performance of the team in its current environment (process, organizational structure, equipment, etc.) but reflect things that individual engineers understand and deal with daily. In fact, engineers are likely to have lots of suggestions for how to improve these metrics and are generally motivated to do so because the metrics reflect the quality of the environment around the engineers. For example, efficiency may be low because the configuration management system creates a bottleneck and stalls the engineers. They can recognize that situation and offer suggestions for how the problem could be resolved, and those resolutions would have a positive effect on efficiency. In addition to improving efficiency, that change also has a direct and positive effect on the engineer's environment.

We must note one important thing. Throughput is a better team metric in the short term than in the long term because there is a chance that throughput will decline as the project progresses. It may be that a number of highly valuable features that are relatively easy to develop[9] will be selected early in the project. This will lead to high early throughput. As the project progresses, stories may have lower value-to-ideal-effort ratios, which will result in lower throughput, but that change does not reflect poorly on the team. In effect, as throughput declines, it may very well indicate that the project has reached a point of diminishing returns. It can be a sign that the project should be terminated. Even though it may decline over time, the purpose of tracking throughput is to strengthen the connection between the team and customers. On any given day, the team can weigh decisions against the impact they will have on short-term throughput without worrying about its long-term decline.

[9] These are known as "low-hanging fruit." Apples that are easy to pick require less effort but are as valuable as the apples at the top of the tree.