This section discusses the sampling frameworks and distribution assumptions for the CATMOD and FREQ procedures.
Suppose you take a simple random sample of 100 people and ask each person the following question: Of the three colors red, blue, and green, which is your favorite? You then tabulate the results in a frequency table as shown in Table 4.1.
Favorite Color | ||||
---|---|---|---|---|
Red | Blue | Green | Total | |
Frequency | 52 | 31 | 17 | 100 |
Proportion | 0.52 | 0.31 | 0.17 | 1.00 |
In the population you are sampling, you assume there is an unknown probability that a population member, selected at random, would choose any given color. In order to estimate that probability, you use the sample proportion
where n j is the frequency of the j th response and n is the total frequency.
Because of the random variation inherent in any random sample, the frequencies have a probability distribution representing their relative frequency of occurrence in a hypothetical series of samples. For a simple random sample, the distribution of frequencies for a frequency table with three levels is as follows . The probability that the first frequency is n 1 , the second frequency is n 2 , and the third is n 3 = n ˆ’ n 1 ˆ’ n 2 , is given by
where j is the true probability of observing the j th response level in the population.
This distribution, called the multinomial distribution , can be generalized to any number of response levels. The special case of two response levels is called the binomial distribution .
Simple random sampling is the type of sampling required by PROC CATMOD when there is one population. PROC CATMOD uses the multinomial distribution to estimate a probability vector and its covariance matrix. If the sample size is sufficiently large, then the probability vector is approximately normally distributed as a result of central limit theory. PROC CATMOD uses this result to compute appropriate test statistics for the specified statistical model.
Suppose you take two simple random samples, 50 men and 50 women, and ask the same question as before. You are now sampling two different populations that may have different response probabilities. The data can be tabulated as shown in Table 4.2.
Favorite Color | ||||
Sex | Red | Blue | Green | Total |
Male | 30 | 10 | 10 | 50 |
Female | 20 | 10 | 20 | 50 |
Total | 50 | 20 | 30 | 100 |
Note that the row marginal totals (50, 50) of the contingency table are fixedbythe sampling design, but the column marginal totals (50, 20, 30) are random. There are six probabilities of interest for this table, and they are estimated by the sample proportions
where n ij denotes the frequency for the i th population and the j th response, and n i is the total frequency for the i th population. For this contingency table, the sample proportions are shown in Table 4.3.
Favorite Color | ||||
Sex | Red | Blue | Green | Total |
Male | 0.60 | 0.20 | 0. 20 | 1.00 |
Female | 0.40 | 0. 20 | 0.40 | 1.00 |
The probability distribution of the six frequencies is the product multinomial distribution
where ij is the true probability of observing the j th response level in the i th population. The product multinomial distribution is simply the product of two or more individual multinomial distributions since the populations are independent. This distribution can be generalized to any number of populations and response levels.
Stratified simple random sampling is the type of sampling required by PROC CATMOD when there is more than one population. PROC CATMOD uses the product multinomial distribution to estimate a probability vector and its covariance matrix. If the sample sizes are sufficiently large, then the probability vector is approximately normally distributed as a result of central limit theory, and PROC CATMOD uses this result to compute appropriate test statistics for the specified statistical model. The statistics are known as Wald statistics, and they are approximately distributed as chi-square when the null hypothesis is true.
Sometimes the observed data do not come from a random sample but instead represent a complete set of observations on some population. For example, suppose a class of 100 students is classified according to sex and favorite color. The results are shown in Table 4.4.
Favorite Color | ||||
Sex | Red | Blue | Green | Total |
Male | 16 | 21 | 20 | 57 |
Female | 12 | 20 | 11 | 43 |
Total | 28 | 41 | 31 | 100 |
In this case, you could argue that all of the frequencies are fixed since the entire population is observed; therefore, there is no sampling error. On the other hand, you could hypothesize that the observed table has only fixed marginals and that the cell frequencies represent one realization of a conceptual process of assigning color preferences to individuals. The assignment process is open to hypothesis, which means that you can hypothesize restrictions on the joint probabilities.
The usual hypothesis (sometimes called randomness ) is that the distribution of the column variable (Favorite Color) does not depend on the row variable (Sex). This implies that, for each row of the table, the assignment process corresponds to a simple random sample (without replacement) from the finite population represented by the column marginal totals (or by the column marginal subtotals that remain after sampling other rows). The hypothesis of randomness induces a probability distribution on the frequencies in the table; it is called the hypergeometric distribution .
If the same row and column variables are observed for each of several populations, then the probability distribution of all the frequencies can be called the multiple hypergeometric distribution. Each population is called a stratum , and an analysis that draws information from each stratum and then summarizes across them is called a stratified analysis (or a blocked analysis or a matched analysis ). PROC FREQ does such a stratified analysis, computing test statistics and measures of association.
In general, the populations are formed on the basis of cross-classifications of independent variables. Stratified analysis is a method of adjusting for the effect of these variables without being forced to estimate parameters for them.
The multiple hypergeometric distribution is the one used by PROC FREQ for the computation of Cochran-Mantel-Haenszel statistics. These statistics are in the class of randomization model test statistics , which require minimal assumptions for their validity. PROC FREQ uses the multiple hypergeometric distribution to compute the mean and the covariance matrix of a function vector in order to measure the deviation between the observed and expected frequencies with respect to a particular type of alternative hypothesis. If the cell frequencies are sufficiently large, then the function vector is approximately normally distributed as a result of central limit theory, and FREQ uses this result to compute a quadratic form that has a chi-square distribution when the null hypothesis is true.
Consider a randomized experiment in which patients are assigned to one of two treatment groups according to a randomization process that allocates 50 patients to each group . After a specified period of time, each patient's status (cured or uncured) is recorded. Suppose the data shown in Table 4.5 give the results of the experiment. The null hypothesis is that the two treatments are equally effective. Under this hypothesis, treatment is a randomly assigned label that has no effect on the cure rate of the patients. But this implies that each row of the table represents a simple random sample from the finite population whose cure rate is described by the column marginal totals. Therefore, the column marginals (58, 42) are fixed under the hypothesis. Since the row marginals (50, 50) are fixed by the allocation process, the hypergeometric distribution is induced on the cell frequencies. Randomized experiments can also be specified in a stratified framework, and Cochran-Mantel-Haenszel statistics can be computed relative to the corresponding multiple hypergeometric distribution.
Status | |||
---|---|---|---|
Treatment | Cured | Uncured | Total |
1 | 36 | 14 | 50 |
2 | 22 | 28 | 50 |
Total | 58 | 42 | 100 |
As indicated previously, the CATMOD procedure assumes that the data are from a stratified simple random sample, so it uses the product multinomial distribution. If the data are not from such a sample, then in many cases it is still possible to use PROC CATMOD by arguing that each row of the contingency table does represent a simple random sample from some hypothetical population. The extent to which the inferences are generalizable depends on the extent to which the hypothetical population is perceived to resemble the target population.
Similarly, the Cochran-Mantel-Haenszel statistics use the multiple hypergeometric distribution, which requires fixed row and column marginal totals in each contingency table. If the sampling process does not yield a table with fixed margins, then it is usually possible to fix the margins through conditioning arguments similar to the ones used by Fisher when he developed the Exact Test for 2 — 2 tables. In other words, if you want fixed marginal totals, you can generally make your analysis conditional on those observed totals.
For more information on sampling models for categorical data, see Bishop, Fienberg, and Holland (1975, Chapter 13).