Chapter 2: Introduction to Probability and Statistics for Projects


Chapter 2: Introduction to Probability and Statistics for Projects

Lies, damned lies, and statistics!
"Nothing in progression can rest on its original plan."

Thomas Monson

There are No Facts about the Future

"There are no facts about the future!" This is a favorite saying of noted risk authority Dr. David T. Hulett that sums it up pretty well for project managers. [1] Uncertainty is present in every project, every project being a unique assembly of resources and scope. Every project has its ending some time hence. Time displaces project outcomes from the initial estimates; time displacement introduces the opportunity for something not to go according to plan. Perhaps such an opportunity would actually have an upside advantage to the project, or perhaps not. Successful project managers are those quick to grasp the need for the means to evaluate uncertainty. Risk management is the process, but probability and statistics provide the mathematical underpinning for the quantitative analysis of project risk. Probability and statistics for project managers are the subjects of this chapter.

[1]The author has had a years-long association with Dr. Hulett, principal at Hulett Associates in Los Angeles, California. Among many accomplishments, Dr. Hulett led the project team that wrote the "Project Risk Management" chapter of the 2000 edition of A Guide to the Project Management Body of Knowledge, published by the Project Management Institute®.



Probability ... What Do We Mean by It?

Everyone is familiar with the coin toss — one side is heads, the other side is tails. Flip it often enough and there will be all but an imperceptible difference in the number of times heads comes up as compared to tails. How can we use the coin toss to run projects? Let us connect some "dots."

Coin Toss 101

As an experiment, toss a coin 100 times. Let heads represent one estimate of the duration of a project task, say writing a specification, of 10 days, and let tails represent an estimate of duration of 15 days for the same task. Both estimates seem reasonable. Let us flip a coin to choose. If the coin were "fair" (that is, not biased in favor of landing on one side or the other), we could reasonably "expect" that there would be 50 heads and 50 tails.

But what of the 101st toss? What will it be? Can we predict the outcome of toss 101 or know with any certainty whether the outcome of toss 101 will be heads or tails, 10 days duration or 15? No, actually, the accumulation of a history of 50 heads and 50 tails provides no information specifically about outcome of the 101st toss except that the range of possible performance is now predictable: toss 101 will be either discretely heads or tails, and the outcome of toss 101 is no more likely to come up heads than tails. So with no other information, and only two choices, it does not matter whether the project manager picks 10 or 15 days. Both are equally likely and within the predicted or forecast range of performance. However, what of the next 100 tosses? We can reasonably expect a repeat of results if no circumstances change.

The phenomenon of random events is an important concept to grasp for applications to projects: the outcome of any single event, whether it be a coin toss or a project, cannot be known with certainty, but the pattern of behavior of an event repeated many times is predictable.

However, projects rarely if ever repeat. Nonrepetitiveness is at the core of their uncertainty. So how is a body of mathematics based on repetition helpful to project managers? The answer lies in how project managers estimate outcomes and forecast project behavior. We will see the many disadvantages of a "single-point" estimate, sometimes called the deterministic estimate, and appreciate the helpfulness of forecasting a reasonable range of possibilities instead. In the coin toss, there is a range of possible outcomes, heads or tails. If we toss the coin many times, or simulate the tossing of many times, we can measure or observe the behavior of the outcomes. We can infer (that is, draw a conclusion from the facts presented) that the next specific toss will have similar behavior and the outcome will lie within the observed space. So it is with projects. From the nondeterministic probabilistic estimates of such elements as cost, schedule, or quality errors, we can simulate the project behavior, drawing an inference that a real project experience would likely lie within the observed behavior.

Calculating Probability

Most people would say that the probability of an outcome of heads or tails in a fair coin toss is 50%; some might say it is half, 0.5, or one chance in two. They all would be correct. There is no chance (that is, 0 probability) that neither heads nor tails will come up; there is 100% probability (that is, 1) that either heads or tails will come up.

What if the outcome was the value showing on a die from a pair of dice that was rolled or tossed? Again most people would say that the probability is 1/6 or one chance in six that a single number like a "5" would come up. If two dice were tossed, what is the chance that the sum of the showing faces will equal "7"? This is a little harder problem, but craps players know the answer. We reason this way: There are 36 combinations of faces that could come up with the repetitive roll of the die, like a "1" on both faces (1,1) or a "1" on one face and a "3" on the other (1,3). There are six combinations that total "7" (1,6; 6,1; 2,5; 5,2; 3,4; 4,3) out of a possible 36, so the chances of a "7" are 6/36 or 1/6.

The probability that no combination of any numbers will show up on a roll of the dice is 0; the probability that any combination of numbers will show up is 36/36, or 1. After all, there are 36 ways that any number could show up. Any specific combination like a (1,6) or a (5,3) will show up with probability of 1/36 since a specific combination can only show up one way. Finally, any specific sum of the dice will show up with probability between 1/36 and 6/36. Table 2-1 illustrates this opportunity space.

Table 2-1: Roll of the Dice

Combination Number

Face Value #1

Face Value #2

Sum of Face Values

1

1

1

2

2

2

1

3

3

3

1

4

4

4

1

5

5

5

1

6

6

6

1

7

7

1

2

3

8

2

2

4

9

3

2

5

10

4

2

6

11

5

2

7

12

6

2

8

13

1

3

4

14

2

3

5

15

3

3

6

16

4

3

7

17

5

3

8

18

6

3

9

....

....

....

....

36

6

6

12

Notes: The number "7" is the most frequent of the "Sum of Face Values," repeating once for each pattern of 6 values of the first die, for a total of 6 appearances in 36 combinations.

Although not apparent from the abridged list of values, the number "5" appearing in the fourth-down position in the first pattern moves up one position each pattern repetition and thus has only 4 total appearances in 36 combinations.

Relative Frequency Definitions

The exercise of flipping coins or rolling dice illustrates the "relative frequency" view of probability. Any specific result is an "outcome," like rolling the pair (1,6). The six pairs that total seven on the faces are collectively an "event." We see that the probability of the event "7" is 1/6, whereas the probability of the outcome (1,6) is 1/36. All the 36 possible outcomes are a "set," the event being a subset. "Population" is another word for "set." "Opportunity space" or sometimes "space" is used interchangeably with population, set, subset, and event. Context is required to clarify the usage.

The relative frequency interpretation of probability leads to the following equation as the general definition of probability:

p(A) = N(A)/N(M)

where p(A) is probability of outcome (or event) A, N(A) = number of times A occurs in the population, and N(M) = number of members in the population.

AND and OR

Let's begin with OR. Using the relative frequency mathematics already developed, what is the probability of the event "5"? There are only four outcomes, so the event "5" probability is 4/36 or 1/9.

To make it more interesting, let's say that event "5" is the desired result of one of two study teams observing the same project problem in "dice rolling." One team we call study team 5; the other we call study team 7. We need to complete the study in the time to complete 36 rolls. Either team finishing first is okay since both are working on the same problem. The probability that study team 5 will finish its observations on time is 4/36. If any result other than "5" is obtained, more time will be needed.

Now let's expand the event to a "5" or a "7". Event "7" is our second study team, with on-time results if it observes a "7". Recall that there are six event "7" opportunities. There are ten outcomes between the two study teams that fill the bill, giving either a "5" or a "7" as the sum of the two faces. Then, by definition, the probability of the event "5 OR 7" is 10/36. The probability that one or the other team will finish on time is 10/36, which is less risky than depending solely on one team to finish on time. Note that 10/36 is the sum of the probabilities of event "7" plus event "5": 6/36 + 4/36 = 10/36. [2]

It is axiomatic in the mathematics of probability that if two events are mutually exclusive, then when one occurs the other will not:

p(A OR B) = p(A) + p(B)

Now let's discuss AND. Consider the probability of the schedule event "rolling a 6 on one face AND rolling a 1 on the other face." The probability of rolling a "6" on one die is 1/6, as is the probability of rolling a "1" on the other. For the event to be true (that is, both a "6" and a "1" occur), both outcomes have to come up. Rolling the first die gives only a one-in-six chance of a "6"; there is another one-in-six chance that the second die will come up "1", altogether a one-in-six chance times a one-in-six chance, or 1/36. Of course 1/36 is very much less than 1/6. If the outcomes above were about our study teams finishing on time, one observing for a "6" on one die and the other observing for a "1" on the other die, then the probability of both finishing on time is the product of their individual probabilities. Requiring them both to finish on time increases the risk of a joint on-time finish event. We will study this "merge point" event is more detail in the chapter on schedules.

Another axiom we can write down is that if two events are independent, in no way dependent on each other, then:

p(A AND B) = p(A * B) = p(A) * p(B)

AND and OR with Overlap or Collisions

We can now go one step further and consider the situation where events A and B are not mutually exclusive (that is, A and B might occur together sometimes or perhaps overlap in some way). Figure 2-1 illustrates the case. As an example, let's continue to observe pairs of dice, but the experiment will be to toss three die at once. All die remain independent, but the occurrence of the event "7" or "5" is no longer mutually exclusive. The event {3,4,1} might be rolled providing the opportunity for the (3,4) pair of the "7" event and the (4,1) pair of the "5" event. However, we are looking for p(A) or p(B) but not p(A * B); therefore, those tosses like event {3,4,1} where "A and B" occur cannot be counted, thereby reducing the opportunities for either A or B. Throwing out the occurrence of "A and B" reduces the chances for "A or B" alone to happen. From this reasoning comes a more general equation for OR:

p(A OR B) = p(A) + p(B) - p(A * B)

click to expand
Figure 2-1: Overlapping Events.

Notice that if A and B are mutually exclusive, then p(A * B) = 0, which gives a result consistent with the earlier equation given for the OR situation.

Looking at the above equation, the savvy project manager recognizes that risk has increased for achieving a successful outcome of either A or B since the possibility of their joint occurrence, overlap, or collision of A and B takes away from the sum of p(A) + p(B). Such a situation could come up in the case of two resources providing inputs to the project, but if they are provided together, then they are not useful. Such collisions or race conditions (a term from system engineering referring to probabilistic interference) are common in many technology projects.

Conditional Probabilities

When A and B are not independent, then one becomes a condition on the outcome of the other. For example, the question might be: What is the probability of A given the condition that B has occurred? [3]

Consider the situation where there are 12 marbles in a jar, 4 black and 8 white. Marbles are drawn from the jar one at a time, without replacement. The p(black marble on first draw) = 4/12. Then, a second draw is made. Conditions have changed because of the first draw. There are only 3 black marbles left and only 11 marbles in the jar. The p(black marble on the second draw given a black marble on the first draw) = 3/11, a slightly higher probability than 4/12 on the first draw. The probability of the second draw is conditioned on the results of the first draw.

The probability of a black marble on each of the first two draws is the AND of the probabilities of the first and second draw. We write the equation:

p(B and A) = p(B) * p(A | B)

where B = event "black on the first draw," A = event "black on second draw, given black on first draw," and the notation "|" means "given." Filling in the numbers, we have:

p(B and A) = (4/12) * (3/11) = 1/11

Consider the project situation given in Figure 2-2. There we see two tasks, Task 1 and Task 2, with a finish-to-start precedence between them. [4] The project team estimates that the probability of Task 1 finishing at the latest finish of "on time + 10 days" is 0.45. Task 1 finishing is event A. The project team estimates that the probability of Task 2 finishing on time is 0.8 if Task 1 finishes on time, but diminishes to 0.4 if Task 1 finishes in "on time + 10 days." Task 2 finishing is event B. If event A is late, it overlaps the slack available to B. We calculate the probability of "A * B" as follows:

p(A10 * B0) = p(B0 | A10) * p(A10) = 0.4 * 0.45 = 0.18

click to expand
Figure 2-2: Conditions in Task Schedules.

where A10 = event of Task 1 finishing "on time + 10 days" and B0 = event of Task 2 finishing on time if Task 1 finishes "on time + 10 days." Conclusion: there is less than one chance in five that both events A10 and B0 will happen. The project manager's risk management focus is on avoiding the outcome of B finishing late by managing Task 1 to less than "on time + 10 days."

There is a lot more to know about conditional probabilities because they are very useful to the project manager in decision making and planning. We will hold further discussion until Chapter 4, where conditional probabilities will be used in decision trees and tables.

The (1-p) Space

To this point, we have been using a number of conventions adopted for probability analysis:

  • All quantitative probabilities lie in the range between the numbers 0 (absolute certainty that an outcome will not occur) and 1 (absolute certainty that an outcome will occur).

  • The lower case "p" is the notation for probability; it stands for a number between 0 and 1 inclusively. Typically, "p" is expressed as a decimal.

  • If "p" is the probability that project outcome A will happen, then "1-p" is the probability that project outcome A will not occur. We then have the following equation: p + (1-p) = 1 at all times. More rigorously, we write: p(A) + [1-p(A)] = 1, where the notation p(A) means "probability that A will occur."

  • "1-p" is the probability that something else, say project outcome B, will happen instead of A. After all, there cannot be a vacuum in the project for lack of outcome A. Sometimes outcome B is most vexing for project managers. B could well be a "known unknown." To wit: Known...A may not happen, but then unknown...what will happen, what is B?

  • Sometimes it is said: B is in the (1-p) space of A, or the project manager might ask his or her team: "What is in the (1-p) space of A?"

  • Project managers must always account for all the scope and be aware of all the outcomes. Thus: p(A) + p(B) = 1. A common mistake made by the project team is to not define or identify B, focusing exclusively on A.

Of course, there may be more possibilities than just B in the (1-p) space of A. Instead of just B, there might be B, C, D, E, and so forth. In that case, the project manager or risk analyst must be very careful to obey the following equation:

[1-p(A)] = p(B) + p(C) + p(D) + p(E) + ...

The error that often occurs is that the right side sums up too large. That is, we have the condition:

p(A) + p(1-A) > 1

On the right side, there are either too many possibilities identified, their respective probabilities are too large, or on the left side, the probabilities of A are misjudged. It is left to the project team to recognize the error of the situation and take corrective measures.

Subjective Probability

What about statements like "there is a 20% chance of rain or snow today in the higher elevations"? A statement of this type does not express a relative frequency probability. We do not have an idea of the population (number of times it rains or snows in a time period), so there is no way to see the proportionality of the outcome to the total population. Statements of this type, quite commonly made, express a subjective notion of probability. We will see that statements of this type express a "confidence." We arrive at "confidence" by accumulating probabilities over a range. We discuss more about "confidence" later.

[2]In the real world, events "5" and "7" could be the names given to two events in a "tree" or hierarchy of events requiring study. For example, events "5" and "7" could be error conditions that might occur. Rolling the dice is equivalent to simulating operational performance. Dr. David Hulett suggested the fault tree to the author.

[3]A might be one approach in a project and B might be another approach. Such a situation comes up often in projects in the form of "if, then, else" conditional branching.

[4]A finish-to-start precedence is a notation from precedence diagramming methodology. It denotes the situation that the finish of the first task is required before the start of the second (successor) task can ensue. It is often referred to as a "waterfall" relationship because, when drawn on paper, it gives the appearance of a cascade from one task to the other. More on critical path scheduling can be found in Chapter 6 of A Guide to the Project Management Body of Knowledge. [5]

[4]A Guide to the Project Management Body of Knowledge (PMBOK® Guide) — 2000 Edition, Project Management Institute, Newtown Square, PA, chap. 6, p. 69.