Section 04. ANOVA

04. ANOVA

Overview

One-Way Analysis of Variance (ANOVA) is used to compare the means of two or more samples against each other to determine whether it is likely that the samples could come from populations with the same mean. This is similar to a 2-Sample t-Test except that three or more samples can be examined with ANOVA.

ANOVA can also be used to examine multiple Xs at the same time (see "Other Options" in this section), but here the focus is primarily on the One-Way ANOVA, which examines just one X. For example, a Team might need to determine if three operators:

A single XOperator
With 3 levels3 Operators

Take the same amount of time to perform a task. A data sample would be taken of, for example, 15 points (times in this case) for each operator. ANOVA is used to make the judgment if all the operators' average (mean) task times (as work continues thenceforth) are the same. The level of confidence in the answer depends on how far apart the means of the samples are, how much variability there is in the sample data, and how many data points there are.

This is shown graphically in Figure 7.04.1. The upper curves represent the distributions of all three operators' times (known as the populations). The exact nature of each of these distributions is unknown to the Team, because they represent all data points for all time. What the Team can see, however, are the samples taken, one from each population, shown as the lower curves. ANOVA examines the sample data with the aim of making an inference on the location of the population means (μ) relative to each other. It does this by breaking down the variation (using variances) in all the sample data into separate pieces, hence the name Analysis Of Variance. ANOVA compares the size of the variation between the samples versus the variation within the samples.

Figure 7.04.1. Graphical representation of ANOVA.

If the variation between the samples is large relative to the variation within the samples, then it means the samples are spread widely (between) compared with the background noise (within), and this would imply that the likelihood of the means of the parent distributions being aligned is low. If the between variation is not large compared with the within variation, then it is likely that the means of the parent distribution are about the same, or more specifically that the test cannot distinguish between them. The result of the test would be a degree of confidence (a p-value) that the samples come from populations with the same mean. In practical terms, the p-value gives an indication of the probability that the mean operator times are the same going forward. If the p-value is low, then at least one of the mean operator times is distinguishable from the others; if the p-value is high, they all are not distinguishable.

Roadmap

The roadmap of the test analysis itself is shown graphically in Figure 7.04.2.

Figure 7.04.2. One-Way ANOVA Roadmap.^[6]

^[6] Roadmap adapted from SBTI's Process Improvement Methodology training material.


Step 1.	Identify the metric and levels to be examined (for example, three operators). Analysis of this kind should be done in the Analyze Phase at the earliest, so the metric should be well defined and understood at this point (see "KPOVs and Data" in this chapter).
Step 2.	Determine the sample size. This can be as simple as taking the suggested 15 to 20 data points per level or using a sample size calculator in a statistical package. These rely on an equation relating the sample size to The standard deviations (the spread of the data) of each population. This would have to be approximated from historical data. The required power of the test (the likelihood of the test identifying a difference between the means if there truly was one). This is usually set at 0.8 or 80%. The power is actually (1 β), where β is the likelihood of giving a false negative and so it might need to be entered in the software as a β of 0.2 or 20%. The size of the difference δ between the means that is desired to be detected, that is the distance between the means that would lead the Team to say that the two values are different. The alpha level for the test (the likelihood of the test giving a false positive) usually set at 0.05 or 5% and represents the cutoff for the p-value (remember if p is low, H₀ must go). The number of levels examined (number of Operators, and so on).
Step 3.	Collect a sample data set, one from each level of the X following the rules of good experimentation. If the sample size calculator determined a sample size of ten data points, then ten points need to be collected for each and every level. For example, if the X is Operator and there are three levels (three operators), then 3 x 10 = 30 data points are collected in total.
Step 4.	Examine stability of all sample data sets using a Control Chart for each, typically an Individuals and Moving Range Chart (I-MR). A Control Chart identifies whether the processes are stable, having Constant mean (from the Individuals Chart) Predictable variability (from the Range Chart) This is important; if the processes are moving around, it is impossible to sensibly decide if they are the same or not.

Step 5.	Examine normality of the sample data sets using a Normality Test for each. This is important because the statistical tests in Step 6 and 7 rely on it, but in simple terms, if the sample curves in Figure 7.04.1 were strange shapes, it would be difficult to determine if the middles were aligned. In fact, if data becomes skewed, then the mean is probably not the best measure of center (a t-Test is a mean-based test), and a median-based test is probably better. The longer tail on the right of the example curve in Figure 7.04.3 drags the mean to the right; however, the median tends to remain constant. Medians-based tests could in theory be used for everything as a more robust test, but they are less powerful than their means-based counterparts, and hence the desire to go with the mean. Figure 7.04.3. Measures of Center.
Step 6.	Perform a Test of Equal Variance on the sample data sets. ANOVA requires the variances of the samples to be approximately the same, and without this, a medians-based approach has to be used instead. The Test of Equal Variance uses the sample data sets and has these hypotheses: H₀: Population (process) σ₁² = σ₂² = σ₃²... (all variances equal) H_a: At least one of the Population (process) variances is different

Step 7.	Perform the ANOVA if all of the sample data sets were determined to be normal in Step 5 and the variances were equal in Step 6. The hypotheses in this case are H₀: Population (process) μ₁² = μ₂² = μ₃²... (means equal) H_a: At least one of the Population (process) means is different If the data in either or both of the samples were non-normal then as per Figure 7.04.2: Continue unabated with the ANOVA if the sample size is large enough (>25) Transform the data first and then perform the analysis, again using the ANOVA^[7] ^[7] Transformation of data is considered beyond the scope of this book. Perform the median-based equivalent test, a Kruskal-Wallis or Moods Median Test If the variances of the samples were not equal then as per Figure 7.04.2, perform the median-based equivalent test, a Kruskal-Wallis or Moods Median Test.

The last option often worries Belts, but the medians tests look identical in form to the means test and both return the key item, a p-value (the p-values for a means test and a medians test on the same data are unlikely to be the same though).

Interpreting the Output

One-Way ANOVA^[8] segments the total variation in the data into two pieces:

^[8] The technical details of ANOVA are covered in most statistics textbooks; Statistics for Management and Economics by Keller and Warrack makes it understandable to non-statisticians.

Variation within levels of the X (basically the background noise in the process)
Variation between levels of the X (the signal strength due to the X)

It then calculates a ratio of the signal (variation due to the X, the "between") relative to the noise (any other variation not due to the X, the "within"). If the signal-to-noise ratio gets large enough then this would be considered to be unlikely to have occurred purely by random chance and the X is thus considered statistically significant. This is achieved by looking up the signal-to-noise ratio in a reference distribution (F-Test), which returns a p-value. The p-value represents the likelihood that an effect this large could have occurred purely by random chance even if the populations were aligned.

Based on the p-values, statements can be generally formed as follows:

Based on the data, I can say that at least one of the means is different and there is a (p-value) chance that I am wrong
Or based on the data, I can say that there is an important effect due to this X and there is a (p-value) chance the result is just due to chance

Example output from an ANOVA is shown in Figure 7.04.4.

Figure 7.04.4. ANOVA results for a comparison of samples of Bob's vs Jane's vs Walt's performance (output from Minitab v14).

From the first table in the results:

The average variation due to Operator was 40.193 units (SS ÷ DF in the Table^[9]).
^[9] From a practical standpoint, a discussion of Degrees of Freedom (DF) and Sequential Sums of Squares (SS) is usually just confusing to Belts. If you are so inclined, then refer to any standard Statistics text. The favorite Statistics for Management & Economics by Keller and Warrack will prove usefulhowever, I caution Belts that if they are so wrapped up in the statistics, they are probably missing the bigger practical picture.
The average variation due to Error (everything else not including Operator) was 0.898 units.
The signal-to-noise ratio is therefore 40.193 ÷ 0.898 = 44.76.
The likelihood of seeing a signal-to-noise ratio this large (if the populations were perfectly aligned) is 0.000% (p-value), which is well below 0.05, and thus, a conclusion that at least one of the trio is performing significantly differently from the others.
The X "Operator" explains 50.72% of the variation in the data (the R-Sq value).
R-Sq (Adj) is close to R-Sq; so there are no redundant terms in the model (if this value drops much lower than R-Sq, which commonly occurs in a multi-way ANOVA, then it is likely that an X is having no effecthere the X clearly has a marked effect).
49.28% of the variation in the data is coming from something other than Operator, and thus, presents a possible opportunity (100% R-Sq).

From the bottom table in the results:

A sample of 30 data points was taken for each operator.
Bob's sample mean is 24.848, Jane's is 25.446, and Walt's is 27.084.
Bob's sample standard deviation is 0.869, Jane's is 0.988, and Walt's is 0.981.
The text graph shows the 95% confidence intervals for the locations of the population means for each of the trio.

The p-value of 0.000% in the upper table indicates that at least one of the trio is performing differently from the other two. There is no overlap in 95% confidence intervals in the bottom table between Walt's performance and the other two; therefore, it is clearly Walt who has a different mean.

Other Options

The preceding description gave the roadmap for a single X (One-Way ANOVA), however ANOVA can also be applied to multiple Xs and breaks down the variation accordingly. An example of this is shown in Figure 7.04.5.

Figure 7.04.5. Example multi-way ANOVA for defects in Seals (output from Minitab v14).
Analysis of Variance for Seal Defects, using Adjusted SS for tests
Source	DF	Seq SS	Adj SS	Adj MS	F	P
Shift	1	36.000	24.600	24.600	10.40	0.032
Operator	1	12.250	0.344	0.344	0.15	0.722
Cement Lot	3	242.750	151.973	50.658	21.42	0.006
Flame Temp	3	4.645	6.211	2.070	0.88	0.525
Line Speed	3	2.645	2.645	0.882	0.37	0.778
Error	4	9.460	9.460	2.365
Total	15	307.750
S = 1.53786	R Sq = 96.93%		R Sq (adj) = 88.47%

From the ANOVA table:

The p-value for Shift and Cement Lot can be seen to be significant (p-values less than 0.05), but all other Xs are not statistically significant (p-values well above 0.05).
96.93% of the variation is being explained by the Xs (R-Sq), but some Xs are meaningless (the R-Sq (Adj) value is much lower than the R-Sq value).

These two bullets would drive us to re-run the ANOVA, but only include the two significant Xs, see Figure 7.04.6.

Figure 7.04.6. Example multi-way ANOVA re-run including only significant Xs (output from Minitab v14).
Analysis of Variance for Seal Defects, using Adjusted SS for Tests
Source	DF	Seq SS	Adj SS	Adj MS	F	P
Shift	1	36.000	36.000	36.000	13.66	0.004
Cement Lot	3	242.750	242.750	80.917	30.69	0.000
Error	11	29.000	29.000	2.636
Total	15	307.750
S = 1.62369	R Sq = 90.58%		R Sq (adj) = 87.15%

From Figure 7.04.6, the amount of variation explained (R-Sq) has dropped slightly which is to be expected because a number of terms (Xs) have just been removed. The R-Sq (adj) value is much closer to the R-Sq value indicating that all the terms (Xs) included actually do something. The p-values have also decreased representing an increase in significance for the remaining Xs.

No additional re-runs would be required at this point and focus would turn to the practical implication of the Shift and Cement Lot having an impact on the defect rate.

04. ANOVA

Overview

Figure 7.04.1. Graphical representation of ANOVA.

Roadmap

Figure 7.04.2. One-Way ANOVA Roadmap.[6]

Figure 7.04.3. Measures of Center.

Interpreting the Output

Figure 7.04.4. ANOVA results for a comparison of samples of Bob's vs Jane's vs Walt's performance (output from Minitab v14).

Other Options

Figure 7.04.5. Example multi-way ANOVA for defects in Seals (output from Minitab v14).

Figure 7.04.6. Example multi-way ANOVA re-run including only significant Xs (output from Minitab v14).

Figure 7.04.2. One-Way ANOVA Roadmap.^[6]