UNDERLYING DISTRIBUTION ANALYSIS | Six Sigma and Beyond: Statistical Process Control, Volume IV

One of the fundamental bases of the statistical analysis of measurements is our ability to describe the data within the context of a model, or probability distribution. These models are used primarily to describe the shape and area of a given process, so that probabilities may be associated with questions concerning the occurrence of scores of values in the distribution. Common probability distributions for discrete random variables include the binomial and poison distributions. Probability distributions employed to describe continuous random variables include the normal, exponential, Weibull, gamma, and lognormal.

What is not commonly understood , however, is that most techniques typically employed in statistical quality control and research are based on the assumption that the process(es) studied are approximated by a particular model. The selection of a specific formula or method of analysis may, in fact, be incorrect if this assumption is in error. If this erroneous assumption does occur, the decisions that are based on data studied may be incorrect, regardless of the quality of the calculations. Some examples of this situation are as follows :

Many individuals have taken to describing the capability of a process as one in which the process is in control and ±3 (sigma) of the individual parts are within specifications. The use of ±3 (sigma) as a means of describing 99.73% of the area in the distribution, and therefore probability, is appropriate only if the process is normally distributed. If this is incorrect, the calculation of from Rbar/d ₂ is misleading, and the results of the analysis are spurious .
The development of the X and Moving R control chart is heavily dependent on the assumption that the process is normally distributed, in that the Central Limit Theorem does not apply to the X chart.
Many statistical tests, such as the z, t, and ANOVA, depend on the assumption that the population(s) from which the data are drawn are normally distributed. Tests for comparing variances are particularly sensitive to this assumption.
The assumption of a particular distribution is employed when computing tolerance (confidence) intervals and predicting product/part interference and performance.
Life tests are often based on the assumption of an underlying exponential distribution.
Time-to-repair system estimation methods often assume an underlying lognormal or exponential distribution.

Given that the validity of the statistical analysis selected is largely dependent on the correct assumption of a specific process distribution, it is desirable, if not essential, to determine whether the assumption we have made regarding an underlying distribution is reasonable.

Many statisticians would state this hypothesis as follows:

H _o : It is reasonable to assume that the sample data were drawn from a (for instance) normal distribution.

However, many others (Shapiro 1980, for example) believe that this is a misleading statement. R. C. Geary (1947) once suggested that in the front of all textbooks on statistics, the following statement should appear: "Normality is a myth. There never was, and will never be, a normal distribution."

Therefore, Shapiro suggests, the hypothesis tested should actually be stated as follows:

H _o : It is reasonable to approximate our process or population data with a (for example) normal distribution model and its associated analytical techniques.

Given that this is the approach with the most validity, these tests are often run at relatively higher levels of Type 1 error (.10 is frequently suggested). This is due to the fact that in this case, the consequences of committing a Type 1 error are relatively minor. Rejection of the null hypothesis will lead to one or more of the following actions:

Tests are run to find an alternative model and procedures that may be used to assess the data.
The data are transformed so that the assumed model is approximate. An example of this is the Box procedure for comparing the logarithms of variances rather than the variances themselves when the assumption of normality may not be accepted.
Nonparametric, or supposedly "distribution-free," statistical analyses may be used in place of equivalent parametric methods ”for example, the Mann-Whitney U ”rather than a t test, or the Kruskal-Wallis test as a replacement for the ANOVA.