PROC GAM < option > ;
CLASS variables ;
MODEL dependent = < PARAM(effects) >
smoothing effects < /options > ;
SCORE data=SAS-data-set out=SAS-data-set ;
OUTPUT < out=SAS-data-set > keyword < ... keyword > ;
BY variables ;
ID variables ;
FREQ variable ;
The syntax of the GAM procedure is similar to that of other regression procedures in the SAS System. The PROC GAM and MODEL statements are required. The SCORE statement can appear multiple times; all other statements appear only once.
The syntax for PROC GAM is described in the following sections in alphabetical order after the description of the PROC GAM statement.
PROC GAM < option > ;
The PROC GAM statement invokes the procedure. You can specify the following option.
DATA= SAS-data-set
specifies the SAS data set to be read by PROC GAM. The default value is the most recently created data set.
BY variables ;
You can specify a BY statement with PROC GAM to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in the order of the BY variables.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the GAM procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index for the BY variables using the DATASETS procedure.
For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .
CLASS variables ;
The CLASS statement names the classification variables to be used in the analysis. Typical class variables are TREATMENT, SEX, RACE, GROUP , and REPLICATION. If the CLASS statement is used, it must appear before the MODEL statement.
Classification variables can be either character or numeric. Class levels are determined from the formatted values of the CLASS variables. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide , and the discussions for the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .
FREQ variable ;
The FREQ statement names a variable that provides frequencies for each observation in the DATA= data set. Specifically, if n is the value of the FREQ variable for a given observation, then that observation is used n times.
The analysis produced using a FREQ statement reflects the expanded number of observations. You can produce the same analysis (without the FREQ statement) by first creating a new data set that contains the expanded number of observations. For example, if the value of the FREQ variable is 5 for the first observation, the first five observations in the new data set are identical. Each observation in the old data set is replicated n i times in the new data set, where n i is the value of the FREQ variable for that observation.
If the value of the FREQ variable is missing or is less than 1, the observation is not used in the analysis. If the value is not an integer, only the integer portion is used.
The FREQ statement is not available when a loess smoother is included in the model.
ID variables ;
The variables in the ID statement are copied from the input data set to the OUT= data set. If you omit the ID statement, only the variables used in the MODEL statement and requested statistics are included in the output data set.
MODEL dependent= < PARAM(effects) >< smoothing effects >< /options > ;
MODEL event/trails= < PARAM(effects) >< smoothing effects >< /options > ;
The MODEL statement specifies the dependent variable and the independent effects you want to use to model its values. Specify the independent parametric variables inside the parentheses of PARAM( ). The parametric variables can be either CLASS variables or continuous variables. Class variables must be declared with a CLASS statement. Interactions between variables can also be included as parametric effects. The syntax for the specification of effects is the same as for the GLM procedure.
Any number of smoothing effects can be specified, as follows :
Smoothing Effect | Meaning |
---|---|
SPLINE(variable < , df=number > ) | fit smoothing spline with the variable and with DF=number |
LOESS(variable < , df=number > ) | fit local regression with the variable and with DF=number |
SPLINE2(variable, variable < ,df=number > ) | fit bivariate thin-plate smoothing spline with DF=number |
If you do not specify the DF=number option with a smoothing effect, DF=4 is used by default, unless you specify the METHOD=GCV model option. Note that for univariate spline components , a degree of freedom is removed by default to account for the linear portion of the model, so the value displayed in the Fit Summary and Analysis of Deviance tables will be one less than the value you specify.
Both parametric effects and smoothing effects are optional, but at least one of them must be present.
If only parametric variables are present, PROC GAM fits a parametric linear model using the terms inside the parentheses of PARAM( ). If only smoothing effects are present, PROC GAM fits a nonparametric additive model. If both types of effect are present, PROC GAM fits a semiparametric model using the parametric effects as the linear part of the model.
The following table shows how to specify various models for a dependent variable y and independent variables x , x1 , and x2 .
Type of Model | Syntax | Mathematical Form |
---|---|---|
Parametric | model y = param(x); | E ( y ) = ² + ² 1 x |
Nonparametric | model y = spline(x); | E ( y ) = ² + s ( x ) |
Nonparametric | model y = loess(x); | E ( y ) = ² + s ( x ) |
Semiparametric | model y = param(x1) spline(x2); | E ( y ) = ² + ² 1 x 1 + s ( x 2 ) |
Additive | model y = spline(x1) spline(x2); | E ( y ) = ² + s 1 ( x 1 ) + s 2 ( x 2 ) |
Thin-plate spline | model y = spline2(x1,x2); | E ( y ) = ² + s ( x 1 , x 2 ) |
You can specify the following options in the MODEL statement.
ALPHA= number
specifies the significance level ± of the confidence limits on the final nonparametric component estimates when you request confidence limits to be included in the output data set. Specify number as a value between 0 and 1. The default value is 0.05. See the OUTPUT Statement section on page 1568 for more information on the OUTPUT statement.
DIST= distribution-id
specifies the distribution family used in the model. The distribution-id can be either GAUSSIAN , BINOMIAL , BINARY , GAMMA , IGAUSSIAN , or POISSON . The canonical link is used with those distributions. Although theoretically, alternative links are possible, with nonparametric models the final fit is relatively insensitive to the precise choice of link function. Therefore, only the canonical link for each distribution family is implemented in PROC GAM. The loess smoother is not available for DIST=BINOMIAL when the number of trials is greater than 1.
EPSILON= numbe r
specifies the convergence criterion for the backfitting algorithm. The default value is 1E ˆ’ 8.
EPSSCORE= number
specifies the convergence criterion for the local score algorithm. The default value is 1E ˆ’ 8.
ITPRINT
produces an iteration summary table for the smoothing effects.
MAXITER= number
specifies the maximum number of iterations for the backfitting algorithm. The default value is 50.
MAXITSCORE= number
specifies the maximum number of iterations for the local score algorithm. The default value is 100.
METHOD=GCV
specifies that the value of the smoothing parameter should be selected by generalized cross validation. If you specify both METHOD=GCV and the DF= option for the smoothing effects, the user -specified DF= is used, and the METHOD=GCV option is ignored. See the Selection of Smoothing Parameters section on page 1575 for more details on the GCV method.
NOTEST
requests that the procedure not produce the Analysis of Deviance table. This option reduces the running time of the procedure.
OUTPUT OUT= SAS-data-set < keyword ... keyword > ;
The OUTPUT statement creates a new SAS data set containing diagnostic measures calculated after fitting the model.
You can request a variety of diagnostic measures that are calculated for each observation in the data set. The new data set contains the variables specified in the MODEL statement in addition to the requested variables. If no keyword is present, the data set contains only the predicted values.
Details on the specifications in the OUTPUT statement are as follows.
OUT= SAS-data-set
specifies the name of the new data set to contain the diagnostic measures. This specification is required.
keyword
specifies the statistics to include in the output data set. The keywords and the statistics they represent are as follows:
PREDICTED | predicted values for each smoothing component and overall predicted values at design points |
UCLM | upper confidence limits for each predicted smoothing component |
LCLM | lower confidence limits for each predicted smoothing component |
ADIAG | diagonal element of the hat matrix associated with the observation for each smoothing spline component |
RESIDUAL | residual standardized by its weights |
STD | standard deviation of the prediction for each smoothing component |
ALL | implies all preceding keywords |
The names of the new variables that contain the statistics are formed by using a prefix of one or more characters that identify the statistic, followed by an underscore (_), followed by the variable name.
The prefixes of the new variables are as follows:
Keywords | Prefix |
---|---|
PRED | P_ |
UCLM | UCLM_ |
LCLM | LCLM_ |
ADIAG | ADIAG_ |
RESID | R_ |
STD | STD_ for spline STDP_ for loess |
For example, suppose that you have a dependent variable y and an independent smoothing variable x , and you specify the keywords PRED and ADIAG. In this case, the output SAS data set will contain the variables P_y , P_x , and ADIAG_x .
SCORE DATA= SAS-data-set OUT= SAS-data-set ;
The SCORE statement calculates predicted values for a new data set. The variables generated by the SCORE statement use the same naming conventions with prefixes as the OUTPUT statement. If you have multiple data sets to predict, you can specify multiple SCORE statements. You must use a SCORE statement for each data set.
The following options must be specified in the SCORE statement.
DATA= SAS-data-set
specifies an input SAS data set containing all the variables included in independent effects in the MODEL statement. The predicted response is computed for each observation in the SCORE DATA= data set.
OUT= SAS-data-set
specifies the name of the SAS data set to contain the predictions .