PROC SURVEYMEANS Statement | SAS.STAT 9.1 Users Guide (Vol. 6)

PROC SURVEYMEANS < options >< statistic-keywords > ;

The PROC SURVEYMEANS statement invokes the procedure. In this statement, you identify the data set to be analyzed and specify sample design information. The DATA= option names the input data set to be analyzed. If your analysis includes a finite population correction factor, you can input either the sampling rate or the population total using the RATE= or TOTAL= option. If your design is stratified, with different sampling rates or totals for different strata, then you can input these stratum rates or totals in a SAS data set containing the stratification variables .

In the PROC SURVEYMEANS statement, you also can use statistic-keywords to specify statistics for the procedure to compute. Available statistics include the population mean and population total, together with their variance estimates and confidence limits. You can also request data set summary information and sample design information.

You can specify the following options in the PROC SURVEYMEANS statement:

ALPHA= ±

sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of ± produces 100(1 ˆ’ ± )%confidence limits. The default of ALPHA=0.05 produces 95% confidence limits.

DATA = SAS-data-set

specifies the SAS data set to be analyzed by PROC SURVEYMEANS. If you omit the DATA= option, the procedure uses the most recently created SAS data set.

MISSING

requests that the procedure treat missing values as a valid category for all categorical variables, which include categorical analysis variables, strata variables, cluster variables, and domain variables.

ORDER=DATA FORMATTED INTERNAL

specifies the order in which the values of the categorical variables are to be reported . The following shows how PROC SURVEYMEANS interprets values of the ORDER= option:

DATA	orders values according to their order in the input data set.
FORMATTED	orders values by their formatted values. This order is operating environment dependent. By default, the order is ascending .
INTERNAL	orders values by their unformatted values, which yields the same order that the SORT procedure does. This order is operating environment dependent.

By default, ORDER=FORMATTED.
The ORDER= option applies to all the categorical variables. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values.

RATE = _{value SAS-data-set}

R = _{value SAS-data-set}

specifies the sampling rate as a nonnegative value , or names an input data set that contains the stratum sampling rates. The procedure uses this information to compute a finite population correction for variance estimation. If your sample design has multiple stages, you should specify the first-stage sampling rate , which is the ratio of the number of PSUs selected to the total number of PSUs in the population.
For a nonstratified sample design, or for a stratified sample design with the same sampling rate in all strata, you should specify a nonnegative value for the RATE= option. If your design is stratified with different sampling rates in the strata, then you should name a SAS data set that contains the stratification variables and the sampling rates. See the section 'Specification of Population Totals and Sampling Rates' on page 4334 for more details.
The sampling rate value must be a nonnegative number. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYMEANS will convert that number to a proportion. The procedure treats the value 1 as 100%, and not the percentage form 1%.
If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option.

STACKING

requests the procedure to produce the output data sets using a stacking table structure, which was the default in releases prior to Version 9. The new default is to produce a rectangular table structure in the output data sets.
The STACKING option affects the following tables:
- Domain
- Ratio
- Statistics
- StrataInfo

When you use the ODS statement to create SAS data sets for these tables in the output, the data set structure can be either stacking or rectangular. A rectangular structure creates one observation for each analysis variable in the data set. However, if you use the STACKING option in Version 9, the procedure creates only one observation in the output data set for all analysis variables. The following example shows these two structures in output data sets.

  data new;   input sex$ x;   datalines;   M 12   F 5   M 13   F 23   F 11   ;   proc surveymeans data=new mean;   ods output statistics=rectangle;   run;   proc print data=rectangle;   run;   proc surveymeans data=new mean stacking;   ods output statistics=stacking;   run;   proc print data=stacking;   run;

Figure 70.6 shows the rectangular structure of the output data set for the statistics table.

  rectangle structure in the output data set   Var      Var   OBS    Name    Level            Mean          StdErr   1     x                   12.800000        2.905168   2     sex       F          0.600000        0.244949   3     sex       M          0.400000        0.244949

Figure 70.6: Rectangular Structure in the Output Data Set

Figure 70.7 shows the stacking structure of the output data set for the statistics table.

  stacking structure in the output data set   OBS    x          x_Mean        x_StdErr    sex_F      sex_F_Mean   1     x       12.800000        2.905168    sex=F        0.600000   OBS    sex_F_StdErr    sex_M      sex_M_Mean    sex_M_StdErr   1         0.244949    sex=M        0.400000        0.244949

Figure 70.7: Stacking Structure in the Output Data Set

TOTAL = _{value SAS-data-set}

N = _{value SAS-data-set}

specifies the total number of primary sampling units (PSUs) in the study population as a positive value , or names an input data set that contains the stratum population totals. The procedure uses this information to compute a finite population correction for variance estimation.
For a nonstratified sample design, or for a stratified sample design with the same population total in all strata, you should specify a positive value for the TOTAL= option. If your sample design is stratified with different population totals in the strata, then you should name a SAS data set that contains the stratification variables and the population totals. See the section 'Specification of Population Totals and Sampling Rates' on page 4334 for more details.
If you do not specify the TOTAL= option or the RATE= option, then the variance estimation does not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option.

statistic-keywords

specifies the statistics for the procedure to compute. If you do not specify any statistic-keywords, PROC SURVEYMEANS computes the NOBS, MEAN, STDERR, and CLM statistics by default.
The statistics produced depend on the type of the analysis variable. If you name a numeric variable in the CLASS statement, then the procedure analyzes that variable as a categorical variable. The procedure always analyzes character variables as categorical. See the section 'CLASS Statement' on page 4329 for more information.
PROC SURVEYMEANS computes MIN, MAX, and RANGE for numeric variables but not for categorical variables. For numeric variables, the keyword MEAN produces the mean, but for categorical variables it produces the proportion in each category or level. Also for categorical variables, the keyword NOBS produces the number of observations for each variable level, and the keyword NMISS produces the number of missing observations for each level. If you request the keyword NCLUSTER for a categorical variable, PROC SURVEYMEANS displays for each level the number of clusters with observations in that level. PROC SURVEYMEANS computes SUMWGT in the same way for both categorical and numeric variables, as the sum of the weights over all nonmissing observations.
PROC SURVEYMEANS performs univariate analysis, analyzing each variable separately. Thus the number of nonmissing and missing observations may not be the same for all analysis variables. See the section 'Missing Values' on page 4333 for more information.
If you use the keyword RATIO without the keyword MEAN, the keyword MEAN is implied .
Other available statistics computed for a ratio are N, NCLU, SUMWGT, RATIO, STDERR, DF, T, PROBT, and CLM, as listed below. If no statistics are requested , the procedure will compute the ratio and its standard error by default for a RATIO statement.

The valid statistic-keywords are as follows :

ALL	all statistics listed
CLM	100(1 ˆ’ ± ) % two-sided confidence limits for the MEAN, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
CLSUM	100(1 ˆ’ ± )% two-sided confidence limits for the SUM, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
CV	coefficient of variation for MEAN
CVSUM	coefficient of variation for SUM
DF	degrees of freedom for the t test
LCLM	100(1 ˆ’ ± )% one-sided lower confidence limit of the MEAN, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
LCLMSUM	100(1 ˆ’ ± )% one-sided lower confidence limit of the SUM, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
MAX	maximum value
MEAN	mean for a numeric variable, or the proportion in each category for a categorical variable
MIN	minimum value
NCLUSTER	number of clusters
NMISS	number of missing observations
NOBS	number of nonmissing observations
RANGE	range, MAX-MIN
RATIO	ratio of means or proportions
STD	standard deviation of the SUM. When you request SUM, the procedure computes STD by default.
STDERR	standard error of the MEAN or RATIO. When you request MEAN or RATIO, the procedure computes STDERR by default.
SUM	weighted sum, ˆ‘ w _i y _i , or estimated population total when the appropriate sampling weights are used
SUMWGT	sum of the weights, ˆ‘ w _i
T	t -value and its corresponding p -value with DF degrees of freedom for where is the population mean or the population ratio
UCLM	100(1 ˆ’ ± )% one-sided upper confidence limit of the MEAN, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
UCLMSUM	100(1 ˆ’ ± )% one-sided upper confidence limit of the SUM, where ± is determined by the ALPHA= option described on page 4323, and the default is ± = 0 . 05
VAR	variance of the MEAN or RATIO
VARSUM	variance of the SUM

See the section 'Statistical Computations' on page 4336 for details on how PROC SURVEYMEANS computes these statistics.

BY Statement

BY variables ;

You can specify a BY statement with PROC SURVEYMEANS to obtain separate analyses on observations in groups defined by the BY variables.

Note that using a BY statement provides completely separate analyses of the BY groups. It does not provide a statistically valid subpopulation or domain analysis, where the total number of units in the subpopulation is not known with certainty . You should use the DOMAIN statement to obtain domain analysis.

When a BY statement appears, the procedure expects the input data sets to be sorted in order of the BY variables. The variables are one or more variables in the input data set.

If you specify more than one BY statement, the procedure uses only the latest BY statement and ignores any previous ones.

If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data using the SORT procedure with a similar BY statement.
Use the BY statement options NOTSORTED or DESCENDING in the BY statement. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .

CLASS Statement

CLASS CLASSES variables ;

The CLASS statement names variables to be analyzed as categorical variables. For categorical variables, PROC SURVEYMEANS estimates the proportion in each category or level, instead of the overall mean. PROC SURVEYMEANS always analyzes character variables as categorical. If you want categorical analysis for a numeric variable, you must include that variable in the CLASS statement.

The CLASS variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLASS variables determine the categorical variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .

You can use multiple CLASS statements to specify categorical variables.

When you specify class variables, you may use the SAS system option SUMSIZE= to limit (or to specify) the amount of memory that is available for data analysis. Refer to the chapter on SAS System options in SAS Language Reference: Dictionary for a description of the SUMSIZE= option.

CLUSTER Statement

CLUSTER CLUSTERS variables ;

The CLUSTER statement names variables that identify the clusters in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata.

If your sample design has clustering at multiple stages, you should identify only the first-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement. See the section 'Primary Sampling Units (PSUs)' on page 4335 for more information.

The CLUSTER variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLUSTER variables determine the CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .

You can use multiple CLUSTER statements to specify cluster variables. The procedure uses variables from all CLUSTER statements to create clusters.

DOMAIN Statement

DOMAIN SUBGROUP variables < variable * variable
- variable * variable * variable > ;

The DOMAIN statement requests analysis for subpopulations, or domains, in addition to analysis for the entire study population. The DOMAIN statement names the variables that identify domains, which are called domain variables.

It is common practice to compute statistics for domains. The formation of these domains may be unrelated to the sample design. Therefore, the sample sizes for the domains are random variables. In order to incorporate this variability into the variance estimation, you should use a DOMAIN statement.

Note that a DOMAIN statement is different from a BY statement. In a BY statement, you treat the sample sizes as fixed in each subpopulation, and you perform analysis within each BY group independently. See the section 'Domain Statistics' on page 4342 for more details.

A domain variable can be either character or numeric. However, the procedure treats domain variables as categorical variables. If a variable appears by itself in a DOMAIN statement, each level of this variable determines a domain in the study population. If two or more variables are joined by asterisks (*), then every possible combination of levels of the variables determines a domain. The procedure performs a descriptive analysis within each domain defined by the domain variables.

The formatted values of the domain variables determine the categorical variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .

RATIO Statement

RATIO < ' label' > variables / variables ;

The RATIO statement requests ratio analysis for means or proportions of analysis variables. A ratio statement names the variables whose means will be used as numerators or denominators in a ratio. Variables appearing before the slash (/), called numerator variables , are used for numerators. Variables appearing after the slash (/), called denominator variables , are used for denominators. These variables can be any number of analysis variables, either continuous or categorical, in the input data set.

You can optionally specify a label for each RATIO statement to identify the ratios in the output. Labels must be enclosed in single quotes.

If a RATIO statement does not have any numerator variable or denominator variable specified, the RATIO statement will be ignored.

A numerator or denominator variable must be an analysis variable. That is, if there is a VAR statement, then a numerator or denominator variable must appear in the VAR statement. If there is no VAR statement, a numerator or denominator variable must be on the default analysis variable list (see the section 'VAR Statement' on page 4332). If a numerator or denominator variable is not an analysis variable, it is ignored.

The computation of ratios depends on whether the numerator and denominator variables are continuous or categorical.

For continuous variables, ratios are calculated with the mean of the variables. For example, for continuous variables X , Y , Z , and T , the following RATIO statement requests the procedure to analyze the ratios x / z , x / t , y / z , and y / t :

  ratio x y / z t;

If a continuous variable appears as both a numerator and a denominator variable, the ratio of this variable itself is ignored.

For categorical variables, ratios are calculated with the proportions for the categories of a categorical variable. For example, if categorical variable Gender has values 'Male' and 'Female,' with proportions p _m = Pr(Gender='Male') and p _f = Pr(Gender='Female'), and Y is a continuous variable, then the following RATIO statement requests the procedure to analyze the ratios p _m /p _f , p _f /p _m , y /p _m , and y /p _f :

  ratio Gender y / Gender;

If a categorical variable appears as both a numerator and a denominator variable, then the ratios of the proportions for all categories are computed, except the ratio of each category with itself.

You may have more than one RATIO statement. Each RATIO statement produces ratios independently using its own numerator and denominator variables. Each RATIO statement also produces its own ratio analysis table.

Available statistics for a ratio are

N, number of observations used to compute the ratio
NCLU, number of clusters
SUMWGT, sum of weights
RATIO, ratio
STDERR, standard error of ratio
VAR, variance of ratio
T, t -value of ratio
PROBT, p -value of t
DF, degrees of freedom of t
CLM, two-sided confidence limits of ratio
UCLM, one-sided upper confidence limit of ratio
LCLM, one-sided lower confidence limit of ratio

The procedure will calculate these statistics based on the statistic-keywords described on page 4326 which you specified in the PROC statement. If a statistic-keyword is not appropriate for RATIO statement, that statistic-keyword is ignored. If no valid statistics are requested for a RATIO statement, the procedure will compute the ratio and its standard error by default.

Note that ratios within a domain are currently not available.

When calculating the means or proportions for the numerator and denominator variables in a ratio, an observation is excluded if it has a missing value in either the continuous numerator variable or the denominator variable. An observation with missing values is also excluded for the categorical numerator or denominator variables, unless the MISSING option is used.

STRATA Statement

STRATA STRATUM variables < / option > ;

The STRATA statement names variables that form the strata in a stratified sample design. The combinations of categories of STRATA variables define the strata in the sample.

If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section 'Specification of Population Totals and Sampling Rates' on page 4334 for more information.

The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the STRATA variables determine the levels. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in the SAS Procedures Guide .

You can specify the following option in the STRATA statement after a slash (/):

LIST

displays a 'Stratum Information' table, which includes values of the STRATA variables and sampling rates for each stratum. This table also provides the number of observations and number of clusters for each stratum and analysis variable. See the section 'Displayed Output' on page 4345 for more details.

VAR Statement

VAR variables ;

The VAR statement names the variables to be analyzed.

If you want a categorical analysis for a numeric variable, you must also name that variable in the CLASS statement. For categorical variables, PROC SURVEYMEANS estimates the proportion in each category or level, instead of the overall mean. Character variables are always analyzed as categorical variables. See the section 'CLASS Statement' on page 4329 for more information.

If you do not specify a VAR statement, then PROC SURVEYMEANS analyzes all variables in the DATA= input data set, except those named in the BY, CLUSTER, STRATA, and WEIGHT statements.

WEIGHT Statement

WEIGHT WGT variable ;

The WEIGHT statement na5mes the variable that contains the sampling weights. This variable must be numeric. If you do not specify a WEIGHT statement, PROC SURVEYMEANS assigns all observations a weight of 1. Sampling weights must be positive numbers . If an observation has a weight that is nonpositive or missing, then the procedure omits that observation from the analysis. If you specify more than one WEIGHT statement, the procedure uses only the first WEIGHT statement and ignores the rest.