Orders the output according to the BY groups.
See also: Creating Titles That Contain BY- Group Information on page 20
BY <DESCENDING> variable-1
< <DESCENDING> variable-n >
<NOTSORTED>;
variable
specifies the variable that the procedure uses to form BY groups. You can specify more than one variable. If you do not use the NOTSORTED option in the BY statement, then the observations in the data set must either be sorted by all the variables that you specify, or they must be indexed appropriately. Variables in a BY statement are called BY variables .
DESCENDING
specifies that the observations are sorted in descending order by the variable that immediately follows the word DESCENDING in the BY statement.
NOTSORTED
specifies that observations are not necessarily sorted in alphabetic or numeric order. The observations are grouped in another way, for example, chronological order.
The requirement for ordering or indexing observations according to the values of BY variables is suspended for BY-group processing when you use the NOTSORTED option. In fact, the procedure does not use an index if you specify NOTSORTED. The procedure defines a BY group as a set of contiguous observations that have the same values for all BY variables. If observations with the same values for the BY variables are not contiguous, then the procedure treats each contiguous set as a separate BY group.
Note: You cannot use the NOTSORTED option in a PROC SORT step.
Note: You cannot use the GROUPFORMAT option, which is available in the BY statement in a DATA step, in a BY statement in any PROC step.
Procedures create output for each BY group. For example, the elementary statistics procedures and the scoring procedures perform separate analyses for each BY group. The reporting procedures produce a report for each BY group.
Note: All base SAS procedures except PROC PRINT process BY groups independently. PROC PRINT can report the number of observations in each BY group as well as the number of observations in all BY groups. Similarly, PROC PRINT can sum numeric variables in each BY group and across all BY groups.
You can use only one BY statement in each PROC step. When you use a BY statement, the procedure expects an input data set that is sorted by the order of the BY variables or one that has an appropriate index. If your input data set does not meet these criteria, then an error occurs. Either sort it with the SORT procedure or create an appropriate index on the BY variables.
Depending on the order of your data, you may need to use the NOTSORTED or DESCENDING option in the BY statement in the PROC step. For more information on
the BY statement, see SAS Language Reference: Dictionary .
PROC SORT, see Chapter 44, The SORT Procedure, on page 1017.
creating indexes, see INDEX CREATE Statement on page 341.
When a procedure is submitted with a BY statement, the following actions are taken with respect to processing of BY groups:
The procedure determines whether the data is sorted by the internal (unformatted) values of the BY variable(s).
The procedure determines whether a format has been applied to the BY variable(s). If the BY variable is numeric and has no user -applied format, then the BEST12. format is applied for the purpose of BY-group processing.
The procedure continues adding observations to the current BY group until both the internal and the formatted values of the BY variable(s) change.
This process can have unexpected results if, for instance, nonconsecutive internal BY values share the same formatted value. In this case, the formatted value is represented in different BY groups. Alternatively, if different consecutive internal BY values share the same formatted value, then these observations are grouped into the same BY group.
CALENDAR
CHART
COMPARE
CORR
FREQ
MEANS
PLOT
RANK
REPORT (nonwindowing environment only)
SORT (required)
STANDARD
SUMMARY
TABULATE
TIMEPLOT
TRANSPOSE
UNIVARIATE
Note: In the SORT procedure, the BY statement specifies how to sort the data. With the other procedures, the BY statement specifies how the data are currently sorted.
This example uses a BY statement in a PROC PRINT step. There is output for each value of the BY variable, Year. The DEBATE data set is created in Example: Temporarily Dissociating a Format from a Variable on page 29.
options nodate pageno=1 linesize=64 pagesize=40; proc print data=debate noobs; by year; title Printing of Team Members'; title2 'by Year'; run;
Printing of Team Members 1 by Year ------------------------ Year=Freshman ------------------------- Name Gender GPA Capiccio m 3.598 Tucker m 3.901 ------------------------ Year=Sophomore ------------------------ Name Gender GPA Bagwell f 3.722 Berry m 3.198 Metcalf m 3.342 ------------------------- Year=Junior -------------------------- Name Gender GPA Gold f 3.609 Gray f 3.177 Syme f 3.883 ------------------------- Year=Senior -------------------------- Name Gender GPA Baglione f 4.000 Carr m 3.750 Hall m 3.574 Lewis m 3.421
Treats observations as if they appear multiple times in the input data set.
Tip: You can use a WEIGHT statement and a FREQ statement in the same step of any procedure that supports both statements.
FREQ variable ;
variable
specifies a numeric variable whose value represents the frequency of the observation. If you use the FREQ statement, then the procedure assumes that each observation represents n observations, where n is the value of variable . If variable is not an integer, then SAS truncates it. If variable is less than 1 or is missing, then the procedure does not use that observation to calculate statistics. If a FREQ statement does not appear, then each observation has a default frequency of 1.
The sum of the frequency variable represents the total number of observations.
CORR
MEANS/SUMMARY
REPORT
STANDARD
TABULATE
UNIVARIATE
The data in this example represent a ship s course and speed (in nautical miles per hour), recorded every hour . The frequency variable, Hours, represents the number of hours that the ship maintained the same course and speed. Each of the following PROC MEANS steps calculates average course and speed. The different results demonstrate the effect of using Hours as a frequency variable.
The following PROC MEANS step does not use a frequency variable:
options nodate pageno=1 linesize=64 pagesize=40; data track; input Course Speed Hours @@; datalines; 30 4 8 50 7 20 75 10 30 30 8 10 80 9 22 20 8 25 83 11 6 20 6 20 ; proc means data=track maxdec=2 n mean; var course speed; title 'Average Course and Speed'; run;
Without a frequency variable, each observation has a frequency of 1, and the total number of observations is 8.
Average Course and Speed 1 The MEANS Procedure Variable N Mean ----------------------------- Course 8 48.50 Speed 8 7.88 -----------------------------
The second PROC MEANS step uses Hours as a frequency variable:
proc means data=track maxdec=2 n mean; var course speed; freq hours; title 'Average Course and Speed'; run;
When you use Hours as a frequency variable, the frequency of each observation is the value of Hours, and the total number of observations is 141 (the sum of the values of the frequency variable).
Average Course and Speed 1 The MEANS Procedure Variable N Mean ---------------------------------------- Course 141 49.28 Speed 141 8.06 ----------------------------------------
Executes any statements that have not executed and ends the procedure.
QUIT ;
CATALOG
DATASETS
PLOT
PMENU
SQL
Specifies weights for analysis variables in the statistical calculations.
Tip: You can use a WEIGHT statement and a FREQ statement in the same step of any procedure that supports both statements.
WEIGHT variable ;
variable
specifies a numeric variable whose values weight the values of the analysis variables. The values of the variable do not have to be integers. The behavior of the procedure when it encounters a nonpositive weight variable value is as follows:
Weight value | The procedure |
---|---|
| counts the observation in the total number of observations |
less than 0 | converts the weight value to zero and counts the observation in the total number of observations |
missing | excludes the observation from the analysis |
Different behavior for nonpositive values is discussed in the WEIGHT statement syntax under the individual procedure.
Prior to Version 7 of SAS, no base SAS procedure excluded the observations with missing weights from the analysis. Most SAS/STAT procedures, such as PROC GLM, have always excluded not only missing weights but also negative and zero weights from the analysis. You can achieve this same behavior in a base SAS procedure that supports the WEIGHT statement by using the EXCLNPWGT option in the PROC statement.
The procedure substitutes the value of the WEIGHT variable for , which appears in Keywords and Formulas on page 1354.
CORR
FREQ
MEANS/SUMMARY
REPORT
STANDARD
TABULATE
UNIVARIATE
Note: In PROC FREQ, the value of the variable in the WEIGHT statement represents the frequency of occurrence for each observation. See the PROC FREQ documentation in Volume 3 of this book for more information.
The procedures that support the WEIGHT statement also support the VARDEF= option, which lets you specify a divisor to use in the calculation of the variance and standard deviation.
By using a WEIGHT statement to compute moments, you assume that the i th observation has a variance that is equal to ƒ 2 /w i . When you specify VARDEF=DF (the default), the computed variance is a weighted least squares estimate of ƒ 2 . Similarly, the computed standard deviation is an estimate of ƒ . Note that the computed variance is not an estimate of the variance of the i th observation, because this variance involves the observation s weight which varies from observation to observation.
If the values of your variable are counts that represent the number of occurrences of each observation, then use this variable in the FREQ statement rather than in the WEIGHT statement. In this case, because the values are counts, they should be integers. (The FREQ statement truncates any noninteger values.) The variance that is computed with a FREQ variable is an estimate of the common variance, ƒ 2 , of the observations.
Note: If your data come from a stratified sample where the weights represent the strata weights, then neither the WEIGHT statement nor the FREQ statement provides appropriate stratified estimates of the mean, variance, or variance of the mean. To perform the appropriate analysis, consider using PROC SURVEYMEANS, which is a SAS/STAT procedure that is documented in the SAS/STAT User s Guide .
As an example of the WEIGHT statement, suppose 20 people are asked to estimate the size of an object 30 cm wide. Each person is placed at a different distance from the object. As the distance from the object increases , the estimates should become less precise.
The SAS data set SIZE contains the estimate (ObjectSize) in centimeters at each distance (Distance) in meters and the precision (Precision) for each estimate. Notice that the largest deviation (an overestimate by 20 cm) came at the greatest distance (7.5 meters from the object). As a measure of precision, 1/Distance, gives more weight to estimates that were made closer to the object and less weight to estimates that were made at greater distances.
The following statements create the data set SIZE:
options nodate pageno=1 linesize=64 pagesize=60; data size; input Distance ObjectSize @@; Precision=1/distance; datalines; 1.5 30 1.5 20 1.5 30 1.5 25 3 43 3 33 3 25 3 30 4.5 25 4.5 36 4.5 48 4.5 33 6 43 6 36 6 23 6 48 7.5 30 7.5 25 7.5 50 7.5 38 ;
The following PROC MEANS step computes the average estimate of the object size while ignoring the weights. Without a WEIGHT variable, PROC MEANS uses the default weight of 1 for every observation. Thus, the estimates of object size at all distances are given equal weight. The average estimate of the object size exceeds the actual size by 3.55 cm.
proc means data=size maxdec=3 n mean var stddev; var objectsize; title1 'Unweighted Analysis of the SIZE Data Set'; run;
Unweighted Analysis of the SIZE Data Set 1 The MEANS Procedure Analysis Variable : ObjectSize N Mean Variance Std Dev ------------------------------------------------- 20 33.550 80.892 8.994 -------------------------------------------------
The next two PROC MEANS steps use the precision measure (Precision) in the WEIGHT statement and show the effect of using different values of the VARDEF= option. The first PROC step creates an output data set that contains the variance and standard deviation. If you reduce the weighting of the estimates that are made at greater distances, the weighted average estimate of the object size is closer to the actual size.
proc means data=size maxdec=3 n mean var stddev; weight precision; var objectsize; output out=wtstats var=Est_SigmaSq std=Est_Sigma; title1 'Weighted Analysis Using Default VARDEF=DF'; run; proc means data=size maxdec=3 n mean var std vardef=weight; weight precision; var objectsize; title1 'Weighted Analysis Using VARDEF=WEIGHT'; run;
In the first PROC MEANS step, the variance is an estimate of ƒ 2 , where the variance of the i th observation is assumed to be var ( x i ) = ƒ / w i and w i is the weight for the i th observation. In the second PROC MEANS step, the computed variance is an estimate of ( n _ 1/ n ) ƒ 2 / , where is the average weight. For large n, this is an approximate estimate of the variance of an observation with average weight.
Weighted Analysis Using Default VARDEF=DF 1 The MEANS Procedure Analysis Variable : ObjectSize N Mean Variance Std Dev ------------------------------------------------- 20 31.088 20.678 4.547 --------------------------------------------------
Weighted Analysis Using VARDEF=WEIGHT 2 The MEANS Procedure Analysis Variable : ObjectSize N Mean Variance Std Dev ------------------------------------------------- 20 31.088 64.525 8.033 -------------------------------------------------
The following statements create and print a data set with the weighted variance and weighted standard deviation of each observation. The DATA step combines the output data set that contains the variance and the standard deviation from the weighted analysis with the original data set. The variance of each observation is computed by dividing Est_SigmaSq, the estimate of ƒ 2 from the weighted analysis when VARDEF=DF, by each observation s weight (Precision). The standard deviation of each observation is computed by dividing Est_Sigma, the estimate of ƒ from the weighted analysis when VARDEF=DF, by the square root of each observation s weight (Precision).
data wtsize(drop=_freq_ _type_); set size; if _n_=1 then set wtstats; Est_VarObs=est_sigmasq/precision; Est_StdObs=est_sigma/sqrt(precision); proc print data=wtsize noobs; title 'Weighted Statistics'; by distance; format est_varobs est_stdobs est_sigmasq est_sigma precision 6.3; run;
Weighted Statistics 4 ------------------------- Distance=1.5 ------------------------ Object Est_ Est_ Est_ Est_ Size Precision SigmaSq Sigma VarObs StdObs 30 0.667 20.678 4.547 31.017 5.569 20 0.667 20.678 4.547 31.017 5.569 30 0.667 20.678 4.547 31.017 5.569 25 0.667 20.678 4.547 31.017 5.569 -------------------------- Distance=3 ------------------------- Object Est_ Est_ Est_ Est_ Size Precision SigmaSq Sigma VarObs StdObs 43 0.333 20.678 4.547 62.035 7.876 33 0.333 20.678 4.547 62.035 7.876 25 0.333 20.678 4.547 62.035 7.876 30 0.333 20.678 4.547 62.035 7.876 ------------------------- Distance=4.5 ------------------------ Object Est_ Est_ Est_ Est_ Size Precision SigmaSq Sigma VarObs StdObs 25 0.222 20.678 4.547 93.052 9.646 36 0.222 20.678 4.547 93.052 9.646 48 0.222 20.678 4.547 93.052 9.646 33 0.222 20.678 4.547 93.052 9.646 -------------------------- Distance=6 ------------------------- Object Est_ Est_ Est_ Est_ Size Precision SigmaSq Sigma VarObs StdObs 43 0.167 20.678 4.547 124.07 11.139 36 0.167 20.678 4.547 124.07 11.139 23 0.167 20.678 4.547 124.07 11.139 48 0.167 20.678 4.547 124.07 11.139 ------------------------- Distance=7.5 ------------------------ Object Est_ Est_ Est_ Est_ Size Precision SigmaSq Sigma VarObs StdObs 30 0.133 20.678 4.547 155.09 12.453 25 0.133 20.678 4.547 155.09 12.453 50 0.133 20.678 4.547 155.09 12.453 38 0.133 20.678 4.547 155.09 12.453
Subsets the input data set by specifying certain conditions that each observation must meet before it is available for processing.
WHERE where-expression ;
where-expression
is a valid arithmetic or logical expression that generally consists of a sequence of operands and operators. See SAS Language Reference: Dictionary for more information on where processing.
You can use the WHERE statement with any of the following base SAS procedures that read a SAS data set:
CALENDAR
CHART
COMPARE
CORR
DATASETS (APPEND statement)
FREQ
MEANS/SUMMARY
PLOT
RANK
REPORT
SORT
SQL
STANDARD
TABULATE
TIMEPLOT
TRANSPOSE
UNIVARIATE
The CALENDAR and COMPARE procedures and the APPEND statement in PROC DATASETS accept more than one input data set. See the documentation for the specific procedure for more information.
To subset the output data set, use the WHERE= data set option:
proc report data=debate nowd out=onlyfr(where=(year='1')); run;
For more information on WHERE=, see SAS Language Reference: Dictionary .
In this example, PROC PRINT prints only those observations that meet the condition of the WHERE expression. The DEBATE data set is created in Example: Temporarily Dissociating a Format from a Variable on page 29.
options nodate pageno=1 linesize=64 pagesize=40; proc print data=debate noobs; where gpa>3.5; title 'Team Members with a GPA'; title2 'Greater than 3.5'; run;
Team Members with a GPA 1 Greater than 3.5 Name Gender Year GPA Capiccio m Freshman 3.598 Tucker m Freshman 3.901 Bagwell f Sophomore 3.722 Gold f Junior 3.609 Syme f Junior 3.883 Baglione f Senior 4.000 Carr m Senior 3.750 Hall m Senior 3.574