Details

This section contains details about the underlying theory and computations of PROC NLMIXED.

Modeling Assumptions and Notation

PROC NLMIXED operates under the following general framework for nonlinear mixed models. Assume that you have an observed data vector y _i for each of i subjects, i = 1 ,..., s . The y _i are assumed to be independent across i , but within-subject covariance is likely to exist because each of the elements of y _i are measured on the same subject. As a statistical mechanism for modeling this within-subject covariance, assume that there exist latent random-effect vectors u _i of small dimension (typically one or two) that are also independent across i . Assume also that an appropriate model linking y _i and u _i exists, leading to the joint probability density function

where X _i is a matrix of observed explanatory variables and and ¾ are vectors of unknown parameters.

Let = ( , ¾ ) and assume that it is of dimension n . Then inferences about are based on the marginal likelihood function

In particular, the function

is minimized over numerically in order to estimate , and the inverse Hessian (second derivative) matrix at the estimates provides an approximate variance-covariance matrix for the estimate of . The function f ( ) is referred to both as the negative log likelihood function and as the objective function for optimization.

As an example of the preceding general framework, consider the nonlinear growth curve example in the Getting Started section. Here, the conditional distribution p ( y _i X _i , ,u _i ) is normal with mean

and variance ; thus . Also, u _i is a scalar and q ( u _i ¾ ) is normal with mean 0 and variance ; thus .

The following additional notation is also found in this chapter. The quantity ⁽ k ) refers to the parameter vector at the k th iteration, the function g ( ) refers to the gradient vector ˆ f ( ), and the matrix H ( ) refers to the Hessian ˆ ² f ( ).Other symbols are used to denote various constants or option values.

Integral Approximations

An important part of the marginal maximum likelihood method described previously is the computation of the integral over the random effects. The default method in PROC NLMIXED for computing this integral is adaptive Gaussian quadrature as described in Pinheiro and Bates (1995). Another approximation method is the first-order method of Beal and Sheiner (1982, 1988). A description of these two methods follows .

Adaptive Gaussian Quadrature

A quadrature method approximates a given integral by a weighted sum over prede-fined abscissas for the random effects. A good approximation can usually be obtained with an adequate number of quadrature points as well as appropriate centering and scaling of the abscissas. Adaptive Gaussian quadrature for the integral over u _i centers the integral at the empirical Bayes estimate of u _i ,defined as the vector » _i that minimizes

with and ¾ set equal to their current estimates. The final Hessian matrix from this optimization can be used to scale the quadrature abscissas.

Suppose ( z _j , w _j ; j = 1 , ..., p ) denote the standard Gauss-Hermite abscissas and weights (Golub and Welsch 1969, or Table 25.10 of Abramowitz and Stegun 1972). The adaptive Gaussian quadrature integral approximation is as follows.

where r is the dimension of u _i , “ ( X _i , ) is the Hessian matrix from the empirical Bayes minimization, z _j ₁ , ..., j _r is a vector with elements ( z _j ₁ , ..., z _j _r ), and

PROC NLMIXED selects the number of quadrature points adaptively by evaluating the log likelihood function at the starting values of the parameters until two successive evaluations have a relative difference less than the value of the QTOL= option. The specific search sequence is described under the QFAC= option. Using the QPOINTS= option, you can adjust the number of quadrature points p to obtain different levels of accuracy. Setting p = 1 results in the Laplacian approximation as described in Beal and Sheiner (1992), Wolfinger (1993), Vonesh (1992, 1996), Vonesh and Chinchilli (1997), and Wolfinger and Lin (1997).

The NOAD option in the PROC NLMIXED statement requests nonadaptive Gaussian quadrature. Here all » _i are set equal to zero, and the Cholesky root of the estimated variance matrix of the random effects is substituted for “ ( X _i , ) ^{ˆ’ 1 / 2} in the preceding expression for a _j ₁ ,...,j _r . In this case derivatives are computed using the algorithm of Smith (1995). The NOADSCALE option requests the same scaling substitution but with the empirical Bayes » _i .

PROC NLMIXED computes the derivatives of the adaptive Gaussian quadrature approximation when carrying out the default dual quasi-Newton optimization.

First-Order Method

Another integral approximation available in PROC NLMIXED is the first-order method of Beal and Sheiner (1982, 1988) and Sheiner and Beal (1985). This approximation is used only in the case where p ( y _i X _i , ,u _i ) is normal, that is,

where n _i is the dimension of y _i , R _i is a diagonal variance matrix, and m _i is the conditional mean vector of y _i .

The first-order approximation is obtained by expanding m ( X _i , ,u _i ) with a one- term Taylor series expansion about u _i =0, resulting in the approximation

where Z _i ( X _i , ) is the Jacobian matrix ˆ‚ m _i ( X _i , ,u _i ) / ˆ‚ u _i evaluated at u _i =0.

Assuming that q ( u _i ¾ ) is normal with mean 0 and variance matrix G ( ¾ ),thefirst-order integral approximation is computable in closed form after completing the square:

where V _i ( X _i , )= Z _i ( X _i , ) G ( ¾ ) Z _i ( X _i , ) ^T + R _i ( X _i , ). The resulting approximation for f ( ) is then minimized over =( , ¾ ) to obtain the first-order estimates. PROC NLMIXED uses finite-difference derivatives of the first-order integral approximation when carrying out the default dual quasi-Newton optimization.

Optimization Algorithms

There are several optimization techniques available in PROC NLMIXED. You can choose a particular optimizer with the TECH= name option in the PROC NLMIXED statement.

Algorithm	TECH=
trust region Method	TRUREG
Newton-Raphson method with line search	NEWRAP
Newton-Raphson method with ridging	NRRIDG
quasi-Newton methods (DBFGS, DDFP, BFGS, DFP)	QUANEW
double-dogleg method (DBFGS, DDFP)	DBLDOG
conjugate gradient methods (PB, FR, PR, CD)	CONGRA
Nelder-Mead simplex method	NMSIMP

No algorithm for optimizing general nonlinear functions exists that always finds the global optimum for a general nonlinear minimization problem in a reasonable amount of time. Since no single optimization technique is invariably superior to others, PROC NLMIXED provides a variety of optimization techniques that work well in various circumstances. However, you can devise problems for which none of the techniques in PROC NLMIXED can find the correct solution. Moreover, nonlinear optimization can be computationally expensive in terms of time and memory, so you must be careful when matching an algorithm to a problem.

All optimization techniques in PROC NLMIXED use O ( n ² ) memory except the conjugate gradient methods, which use only O ( n ) of memory and are designed to optimize problems with many parameters. Since the techniques are iterative, they require the repeated computation of

the function value (optimization criterion)
the gradient vector (first-order partial derivatives)
for some techniques, the (approximate) Hessian matrix (second-order partial derivatives)

However, since each of the optimizers requires different derivatives, some computational efficiencies can be gained . The following table shows, for each optimization technique, which derivatives are required (FOD: first-order derivatives; SOD: second-order derivatives).

Algorithm	FOD	SOD
TRUREG	x	x
NEWRAP	x	x
NRRIDG	x	x
QUANEW	x	-
DBLDOG	x	-
CONGRA	x	-
NMSIMP	-	-

Each optimization method employs one or more convergence criteria that determine when it has converged . The various termination criteria are listed and described in the PROC NLMIXED Statement section. An algorithm is considered to have converged when any one of the convergence criterion is satisfied. For example, under the default settings, the QUANEW algorithm will converge if ABSGCONV < 1E ˆ’ 5, FCONV < 10 ^{ˆ’ FDIGITS} , or GCONV < 1E ˆ’ 8.

Choosing an Optimization Algorithm

The factors that go into choosing a particular optimization technique for a particular problem are complex and may involve trial and error.

For many optimization problems, computing the gradient takes more computer time than computing the function value, and computing the Hessian sometimes takes much more computer time and memory than computing the gradient, especially when there are many decision variables. Unfortunately, optimization techniques that do not use some kind of Hessian approximation usually require many more iterations than techniques that do use a Hessian matrix, and as a result the total run time of these techniques is often longer. Techniques that do not use the Hessian also tend to be less reliable. For example, they can more easily terminate at stationary points rather than at global optima.

A few general remarks about the various optimization techniques are as follows.

The second-derivative methods TRUREG, NEWRAP, and NRRIDG are best for small problems where the Hessian matrix is not expensive to compute. Sometimes the NRRIDG algorithm can be faster than the TRUREG algorithm, but TRUREG can be more stable. The NRRIDG algorithm requires only one matrix with n ( n +1) / 2 double words; TRUREG and NEWRAP require two such matrices.
The first-derivative methods QUANEW and DBLDOG are best for medium- sized problems where the objective function and the gradient are much faster to evaluate than the Hessian. The QUANEW and DBLDOG algorithms, in general, require more iterations than TRUREG, NRRIDG, and NEWRAP, but each iteration can be much faster. The QUANEW and DBLDOG algorithms require only the gradient to update an approximate Hessian, and they require slightly less memory than TRUREG or NEWRAP ( essentially one matrix with n ( n +1) / 2 double words). QUANEW is the default optimization method.
The first-derivative method CONGRA is best for large problems where the objective function and the gradient can be computed much faster than the Hessian and where too much memory is required to store the (approximate) Hessian. The CONGRA algorithm, in general, requires more iterations than QUANEW or DBLDOG, but each iteration can be much faster. Since CONGRA requires only a factor of n double-word memory, many large applications of PROC NLMIXED can be solved only by CONGRA.
The no-derivative method NMSIMP is best for small problems where derivatives are not continuous or are very difficult to compute.

Algorithm Descriptions

Some details about the optimization techniques are as follows.

Trust Region Optimization (TRUREG)

The trust region method uses the gradient g ( ₍ _k ₎ ) and the Hessian matrix H ( ₍ _k ₎ ); thus, it requires that the objective function f ( ) have continuous first- and second-order derivatives inside the feasible region.

The trust region method iteratively optimizes a quadratic approximation to the nonlinear objective function within a hyper-elliptic trust region with radius ” that constrains the step size corresponding to the quality of the quadratic approximation. The trust region method is implemented using Dennis, Gay, and Welsch (1981), Gay (1983), and Mor and Sorensen (1983).

The trust region method performs well for small- to medium-sized problems, and it does not need many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

Newton-Raphson Optimization with Line Search (NEWRAP)

The NEWRAP technique uses the gradient g ( ₍ _k ₎ ) and the Hessian matrix H ( ₍ _k ₎ ); thus, it requires that the objective function have continuous first- and second-order derivatives inside the feasible region. If second-order derivatives are computed efficiently and precisely, the NEWRAP method may perform well for medium-sized to large problems, and it does not need many function, gradient, and Hessian calls.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. Otherwise , a combination of ridging and line search is performed to compute successful steps. If the Hessian is not positive definite, a multiple of the identity matrix is added to the Hessian matrix to make it positive definite (Eskow and Schnabel 1991).

In each iteration, a line search is performed along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation (LIS=2).

Newton-Raphson Ridge Optimization (NRRIDG)

The NRRIDG technique uses the gradient g ( ₍ _k ₎ ) and the Hessian matrix H ( ₍ _k ₎ ); thus, it requires that the objective function have continuous first- and second-order derivatives inside the feasible region.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. If at least one of these two conditions is not satisfied, a multiple of the identity matrix is added to the Hessian matrix.

The NRRIDG method performs well for small- to medium-sized problems, and it does not require many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

Since the NRRIDG technique uses an orthogonal decomposition of the approximate Hessian, each iteration of NRRIDG can be slower than that of the NEWRAP technique, which works with Cholesky decomposition. Usually, however, NRRIDG requires fewer iterations than NEWRAP.

Quasi-Newton Optimization (QUANEW)

The (dual) quasi-Newton method uses the gradient g ( ₍ _k ₎ ), and it does not need to compute second-order derivatives since they are approximated. It works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian; but, in general, it requires more iterations than the TRUREG, NEWRAP, and NRRIDG techniques, which compute second-order derivatives. QUANEW is the default optimization algorithm because it provides an appropriate balance between the speed and stability required for most nonlinear mixed model applications.

The QUANEW technique is one of the following, depending upon the value of the UPDATE= option.

the original quasi-Newton algorithm, which updates an approximation of the inverse Hessian
the dual quasi-Newton algorithm, which updates the Cholesky factor of an approximate Hessian (default)

You can specify four update formulas with the UPDATE= option:

DBFGS performs the dual Broyden, Fletcher, Goldfarb, and Shanno (BFGS) update of the Cholesky factor of the Hessian matrix. This is the default.
DDFP performs the dual Davidon, Fletcher, and Powell (DFP) update of the Cholesky factor of the Hessian matrix.
BFGS performs the original BFGS update of the inverse Hessian matrix.
DFP performs the original DFP update of the inverse Hessian matrix.

In each iteration, a line search is performed along the search direction to find an approximate optimum. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size ± satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit of the step size. Violating the left-side Goldstein condition can affect the positive definiteness of the quasi-Newton update. In that case, either the update is skipped or the iterations are restarted with an identity matrix, resulting in the steepest descent or ascent search direction. You can specify line-search algorithms other than the default with the LIS= option.

The QUANEW algorithm performs its own line-search technique. All options and parameters (except the INSTEP= option) controlling the line search in the other algorithms do not apply here. In several applications, large steps in the first iterations are troublesome . You can use the INSTEP= option to impose an upper bound for the step size ± during the first five iterations. You can also use the INHESSIAN[= r ] option to specify a different starting approximation for the Hessian. If you specify only the INHESSIAN option, the Cholesky factor of a (possibly ridged) finite difference approximation of the Hessian is used to initialize the quasi-Newton update process. The values of the LCSINGULAR=, LCEPSILON=, and LCDEACT= options, which control the processing of linear and boundary constraints, are valid only for the quadratic programming subroutine used in each iteration of the QUANEW algorithm.

Double Dogleg Optimization (DBLDOG)

The double dogleg optimization method combines the ideas of the quasi-Newton and trust region methods. In each iteration, the double dogleg algorithm computes the step s ⁽ k ⁾ as the linear combination of the steepest descent or ascent search direction and a quasi-Newton search direction .

The step is requested to remain within a prespecified trust region radius; refer to Fletcher (1987, p. 107). Thus, the DBLDOG subroutine uses the dual quasi-Newton update but does not perform a line search. You can specify two update formulas with the UPDATE= option:

DBFGS performs the dual Broyden, Fletcher, Goldfarb, and Shanno update of the Cholesky factor of the Hessian matrix. This is the default.
DDFP performs the dual Davidon, Fletcher, and Powell update of the Cholesky factor of the Hessian matrix.

The double dogleg optimization technique works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian. The implementation is based on Dennis and Mei (1979) and Gay (1983), but it is extended for dealing with boundary and linear constraints. The DBLDOG technique generally requires more iterations than the TRUREG, NEWRAP, or NRRIDG technique, which requires second-order derivatives; however, each of the DBLDOG iterations is computationally cheap. Furthermore, the DBLDOG technique requires only gradient calls for the update of the Cholesky factor of an approximate Hessian.

Conjugate Gradient Optimization (CONGRA)

Second-order derivatives are not required by the CONGRA algorithm and are not even approximated. The CONGRA algorithm can be expensive in function and gradient calls, but it requires only O ( n ) memory for unconstrained optimization. In general, many iterations are required to obtain a precise solution, but each of the CONGRA iterations is computationally cheap. You can specify four different update formulas for generating the conjugate directions by using the UPDATE= option:

PB performs the automatic restart update method of Powell (1977) and Beale (1972). This is the default.
FR performs the Fletcher-Reeves update (Fletcher 1987).
PR performs the Polak-Ribiere update (Fletcher 1987).
CD performs a conjugate-descent update of Fletcher (1987).

The default, UPDATE=PB, behaved best in most test examples. You are advised to avoid the option UPDATE=CD, which behaved worst in most test examples.

The CONGRA subroutine should be used for optimization problems with large n . For the unconstrained or boundary constrained case, CONGRA requires only O ( n ) bytes of working memory, whereas all other optimization methods require order O ( n ² ) bytes of working memory. During n successive iterations, uninterrupted by restarts or changes in the working set, the conjugate gradient algorithm computes a cycle of n conjugate search directions. In each iteration, a line search is performed along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size ± satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit for the step size. Other line-search algorithms can be specified with the LIS= option.

Nelder-Mead Simplex Optimization (NMSIMP)

The Nelder-Mead simplex method does not use any derivatives and does not assume that the objective function has continuous derivatives. The objective function itself needs to be continuous. This technique is quite expensive in the number of function calls, and it may be unable to generate precise results for n ‰« 40.

The original Nelder-Mead simplex algorithm is implemented and extended to boundary constraints. This algorithm does not compute the objective for infeasible points, but it changes the shape of the simplex adapting to the nonlinearities of the objective function, which contributes to an increased speed of convergence. It uses a special termination criteria.

Finite Difference Approximations of Derivatives

The FD= and FDHESSIAN= options specify the use of finite difference approximations of the derivatives. The FD= option specifies that all derivatives are approximated using function evaluations, and the FDHESSIAN= option specifies that second-order derivatives are approximated using gradient evaluations.

Computing derivatives by finite difference approximations can be very time consuming, especially for second-order derivatives based only on values of the objective function (FD= option). If analytical derivatives are difficult to obtain (for example, if a function is computed by an iterative process), you might consider one of the optimization techniques that uses first-order derivatives only (QUANEW, DBLDOG, or CONGRA).

Forward Difference Approximations

The forward difference derivative approximations consume less computer time, but they are usually not as precise as approximations that use central difference formulas.

For first-order derivatives, n additional function calls are required:
For second-order derivatives based on function calls only (Dennis and Schnabel 1983, p. 80), n + n ² / 2 additional function calls are required for dense Hessian:
For second-order derivatives based on gradient calls (Dennis and Schnabel 1983, p. 103), n additional gradient calls are required:

Central Difference Approximations

Central difference approximations are usually more precise, but they consume more computer time than approximations that use forward difference derivative formulas.

For first-order derivatives, 2 n additional function calls are required:
For second-order derivatives based on function calls only (Abramowitz and Stegun 1972, p. 884), 2 n + 4 n ² / 2 additional function calls are required.
For second-order derivatives based on gradient calls, 2 n additional gradient calls are required:

You can use the FDIGITS== option to specify the number of accurate digits in the evaluation of the objective function. This specification is helpful in determining an appropriate interval size h to be used in the finite difference formulas.

The step sizes h _j , j = 1 , , n are defined as follows.

For the forward difference approximation of first-order derivatives using function calls and second-order derivatives using gradient calls, .

For the forward difference approximation of second-order derivatives using only function calls and all central difference formulas, .

The value of · is defined by the FDIGITS= option:

If you specify the number of accurate digits using FDIGITS= r , · is set to 10 ^ˆ’ ^r .
If you do not specify the FDIGITS= option, · is set to the machine precision ˆˆ .

Hessian Scaling

The rows and columns of the Hessian matrix can be scaled when you use the trust region, Newton-Raphson, and double dogleg optimization techniques. Each element H _i,j , i, j = 1 , , n is divided by the scaling factor d _i d _j , where the scaling vector d = ( d ₁ , ,d _n ) is iteratively updated in a way specified by the HESCAL= i option, as follows.

i = 0: No scaling is done (equivalent to d _i = 1).

i ‰ 0: First iteration and each restart iteration sets:

i = 1 : Refer to Mor (1978):

i =2 : Refer to Dennis, Gay, and Welsch (1981):

i = 3 : d _i is reset in each iteration:

In the preceding equations, ˆˆ is the relative machine precision or, equivalently, the largest double precision value that, when added to 1, results in 1.

Active Set Methods

The parameter vector ˆˆ R ⁿ can be subject to a set of m linear equality and inequality constraints:

The coefficients a _ij and right-hand sides b _i of the equality and inequality constraints are collected in the m — n matrix A and the m vector b .

The m linear constraints define a feasible region G in R ⁿ that must contain the point _* that minimizes the problem. If the feasible region G is empty, no solution to the optimization problem exists.

In PROC NLMIXED, all optimization techniques use active set methods . The iteration starts with a feasible point ₍₀₎ , which you can provide or which can be computed by the Schittkowski and Stoer (1979) algorithm implemented in PROC NLMIXED. The algorithm then moves from one feasible point ₍ _k _{ˆ’ 1)} to a better feasible point ₍ _k ₎ along a feasible search direction s ⁽ ^k ⁾ ,

Theoretically, the path of points ₍ _k ₎ never leaves the feasible region G of the optimization problem, but it can reach its boundaries. The active set A ⁽ ^k ⁾ of point ₍ _k ₎ is defined as the index set of all linear equality constraints and those inequality constraints that are satisfied at ₍ _k ₎ . If no constraint is active ₍ _k ₎ , the point is located in the interior of G , and the active set A ⁽ ^k ⁾ = is empty. If the point ₍ _k ₎ in iteration k hits the boundary of inequality constraint i , this constraint i becomes active and is added to A ⁽ ^k ⁾ . Each equality constraint and each active inequality constraint reduce the dimension (degrees of freedom) of the optimization problem.

In practice, the active constraints can be satisfied only with finite precision. The LCEPSILON= r option specifies the range for active and violated linear constraints. If the point ₍ _k ₎ satisfies the condition

where t = r ( b _i +1), the constraint i is recognized as an active constraint. Otherwise, the constraint i is either an inactive inequality or a violated inequality or equality constraint. Due to rounding errors in computing the projected search direction, error can be accumulated so that an iterate ₍ _k ₎ steps out of the feasible region.

In those cases, PROC NLMIXED may try to pull the iterate ₍ _k ₎ back into the feasible region. However, in some cases the algorithm needs to increase the feasible region by increasing the LCEPSILON= r value. If this happens, a message is displayed in the log output.

If the algorithm cannot improve the value of the objective function by moving from an active constraint back into the interior of the feasible region, it makes this inequality constraint an equality constraint in the next iteration. This means that the active set A ⁽ ^k ⁺¹⁾ still contains the constraint i . Otherwise, it releases the active inequality constraint and increases the dimension of the optimization problem in the next iteration.

A serious numerical problem can arise when some of the active constraints become (nearly) linearly dependent. PROC NLMIXED removes linearly dependent equality constraints before starting optimization. You can use the LCSINGULAR= option to specify a criterion r used in the update of the QR decomposition that determines whether an active constraint is linearly dependent relative to a set of other active constraints.

If the solution _* is subjected to n _act linear equality or active inequality constraints, the QR decomposition of the n — n _act matrix ‚ ^T of the linear constraints is computed by ‚ ^T = QR , where Q is an n — n orthogonal matrix and R is an n — n _act upper triangular matrix. The n columns of matrix Q can be separated into two matrices, Q =[ Y, Z ], where Y contains the first n _act orthogonal columns of Q and Z contains the last n ˆ’ n _act orthogonal columns of Q . The n — ( n ˆ’ n _act ) column-orthogonal matrix Z is also called the nullspace matrix of the active linear constraints ‚ ^T . The n ˆ’ n _act columns of the n — ( n ˆ’ n _act ) matrix Z form a basis orthogonal to the rows of the n _act — n matrix ‚ .

At the end of the iterating, PROC NLMIXED computes the projected gradient g _Z ,

In the case of boundary-constrained optimization, the elements of the projected gradient correspond to the gradient elements of the free parameters. A necessary condition for _* to be a local minimum of the optimization problem is

The symmetric n _act — n _act matrix G _Z ,

is called a projected Hessian matrix . A second-order necessary condition for _* to be a local minimizer requires that the projected Hessian matrix is positive semidefinite.

Those elements of the n _act vector of first-order estimates of Lagrange multipliers ,

that correspond to active inequality constraints indicate whether an improvement of the objective function can be obtained by releasing this active constraint. For minimization, a significant negative Lagrange multiplier indicates that a possible reduction of the objective function can be achieved by releasing this active linear constraint. The LCDEACT= r option specifies a threshold r for the Lagrange multiplier that determines whether an active inequality constraint remains active or can be deactivated. (In the case of boundary-constrained optimization, the Lagrange multipliers for active lower (upper) constraints are the negative (positive) gradient elements corresponding to the active parameters.)

Line-Search Methods

In each iteration k , the (dual) quasi-Newton, conjugate gradient, and Newton-Raphson minimization techniques use iterative line-search algorithms that try to optimize a linear, quadratic, or cubic approximation of f along a feasible descent search direction s ⁽ ^k ⁾

by computing an approximately optimal scalar ± ⁽ ^k ⁾ .

Therefore, a line-search algorithm is an iterative process that optimizes a nonlinear function f ( ± ) of one parameter ( ± ) within each iteration k of the optimization technique. Since the outside iteration process is based only on the approximation of the objective function, the inside iteration of the line-search algorithm does not have to be perfect. Usually, it is satisfactory that the choice of ± significantly reduces (in a minimization) the objective function. Criteria often used for termination of line-search algorithms are the Goldstein conditions (refer to Fletcher 1987).

You can select various line-search algorithms by specifying the LIS= option. The line-search method LIS=2 seems to be superior when function evaluation consumes significantly less computation time than gradient evaluation. Therefore, LIS=2 is the default method for Newton-Raphson, (dual) quasi-Newton, and conjugate gradient optimizations.

You can modify the line-search methods LIS=2 and LIS=3 to be exact line searches by using the LSPRECISION= option and specifying the ƒ parameter described in Fletcher (1987). The line-search methods LIS=1, LIS=2, and LIS=3 satisfy the left-hand side and right-hand side Goldstein conditions (refer to Fletcher 1987). When derivatives are available, the line-search methods LIS=6, LIS=7, and LIS=8 try to satisfy the right-hand side Goldstein condition; if derivatives are not available, these line-search algorithms use only function calls.

Restricting the Step Length

Almost all line-search algorithms use iterative extrapolation techniques that can easily lead them to (feasible) points where the objective function f is no longer defined or difficult to compute. Therefore, PROC NLMIXED provides options restricting the step length ± or trust region radius ” , especially during the first main iterations.

The inner product g ^T s of the gradient g and the search direction s is the slope of f ( ± ) = f ( + ± s ) along the search direction s . The default starting value ± ⁽⁰⁾ = ± ⁽ ^k, ⁰⁾ in each line-search algorithm (min _{± >} f ( + ± s )) during the main iteration k is computed in three steps:

The first step uses either the difference df = f ⁽ ^k ⁾ ˆ’ f ⁽ ^k ^{ˆ’ 1)} of the function values during the last two consecutive iterations or the final step-size value ± ^ˆ’ of the last iteration k ˆ’ 1 to compute a first value of .

If the DAMPSTEPoption is not used,

with

This value of can be too large and can lead to a difficult or impossible function evaluation, especially for highly nonlinear functions such as the EXP function.
If the DAMPSTEP[= r ] option is used,

The initial value for the new step length can be no larger than r times the final step length ± ^ˆ’ of the former iteration. The default value is r = 2.

During the first five iterations, the second step enables you to reduce to a smaller starting value using the INSTEP= r option:

After more than five iterations, is set to .
The third step can further reduce the step length by

where u is the maximum length of a step inside the feasible region.

The INSTEP= r option enables you to specify a smaller or larger radius ” of the trust region used in the first iteration of the trust region and double dogleg algorithms. The default initial trust region radius ” ⁽⁰⁾ is the length of the scaled gradient (Mor 1978). This step corresponds to the default radius factor of r = 1. In most practical applications of the TRUREG and DBLDOG algorithms, this choice is successful. However, for bad initial values and highly nonlinear objective functions (such as the EXP function), the default start radius can result in arithmetic overflows. If this happens, you can try decreasing values of INSTEP= r , 0 < r < 1, until the iteration starts successfully. A small factor r also affects the trust region radius ” ⁽ ^k ⁺¹⁾ of the next steps because the radius is changed in each iteration by a factor 0 < c ‰ 4, depending on the ratio expressing the goodness of quadratic function approximation. Reducing the radius ” corresponds to increasing the ridge parameter » , producing smaller steps directed more closely toward the (negative) gradient direction.

Computational Problems

Floating Point Errors and Overflows

Numerical optimization of a numerically integrated function is a difficult task, and the computation of the objective function and its derivatives can lead to arithmetic exceptions and overflows. A typical cause of these problems is parameters with widely varying scales . If the scaling of your parameters varies by more than a few orders of magnitude, the numerical stability of the optimization problem can be seriously reduced and result in computational difficulties. A simple remedy is to rescale each parameter so that its final estimated value has a magnitude near 1.

If parameter rescaling does not help, consider the following actions:

Specify the ITDETAILS option in the PROC NLMIXED statement to obtain more detailed information about when and where the problem is occurring.
Provide different initial values or try a grid search of values.
Use boundary constraints to avoid the region where overflows may happen.
Delete outlying observations or subjects from the input data, if this is reasonable.
Change the algorithm (specified in programming statements) that computes the objective function.

The line-search algorithms that work with cubic extrapolation are especially sensitive to arithmetic overflows. If an overflow occurs during a line search, you can use the INSTEP= option to reduce the length of the first trial step during the first five iterations, or you can use the DAMPSTEP or MAXSTEP option to restrict the step length of the initial ± in subsequent iterations. If an arithmetic overflow occurs in the first iteration of the trust region or double dogleg algorithms, you can use the INSTEP= option to reduce the default trust region radius of the first iteration. You can also change the optimization technique or the line-search method.

Long Run Times

PROC NLMIXED can take a long time to run for problems with complex models, many parameters, or large input data sets. Although the optimization techniques used by PROC NLMIXED are some of the best ones available, they are not guaranteed to converge quickly for all problems. Ill-posed or misspecified models can cause the algorithms to use more extensive calculations designed to achieve convergence, and this can result in longer run times. So first make sure that your model is specified correctly, that your parameters are scaled to be of the same order of magnitude, and that your data reasonably match the model you are contemplating.

If you are using the default adaptive Gaussian quadrature algorithm and no iteration history is printing at all, then PROC NLMIXED may be bogged down trying to determine the number of quadrature points at the first set of starting values. Specifying the QPOINTS= option will bypass this stage and proceed directly to iterations; however, be aware that the likelihood approximation may not be accurate if there are too few quadrature points.

PROC NLMIXED may also have difficulty determining the number of quadrature points if the initial starting values are far from the optimum values. To obtain more accurate starting values for the model parameters, one easy method is to fit a model with no RANDOM statement. You can then use these estimates as starting values, although you will still need to specify values for the random-effects distribution. For normal-normal models, another strategy is to use METHOD=FIRO. If you can obtain estimates using this approximate method, then they can be used as starting values for more accurate likelihood approximations.

If you are running PROC NLMIXED multiple times, you will probably want to include a statement like the following in your program:

  ods output ParameterEstimates=pe;

This statement creates a SAS data set named PE upon completion of the run. In your next invocation of PROC NLMIXED, you can then specify

  parms / data=pe;

to read in the previous estimates as starting values.

To speed general computations, you should check over your programming statements to minimize the number of floating point operations. Using auxiliary variables and factoring amenable expressions can be useful changes in this regard.

Problems Evaluating Code for Objective Function

The starting point ⁽⁰⁾ must be a point for which the programming statements can be evaluated. However, during optimization, the optimizer may iterate to a point ⁽ ^k ⁾ where the objective function or its derivatives cannot be evaluated. In some cases, the specification of boundary for parameters can avoid such situations. In many other cases, you can indicate that the point ⁽⁰⁾ is a bad point simply by returning an extremely large value for the objective function. In these cases, the optimization algorithm reduces the step length and stays closer to the point that has been evaluated successfully in the former iteration.

No Convergence

There are a number of things to try if the optimizer fails to converge.

Change the initial values by using a grid search specification to obtain a set of good feasible starting values.
Change or modify the update technique or the line-search algorithm.

This method applies only to TECH=QUANEW and TECH=CONGRA. For example, if you use the default update formula and the default line-search algorithm, you can
- change the update formula with the UPDATE= option
- change the line-search algorithm with the LIS= option
- specify a more precise line search with the LSPRECISION= option, if you use LIS=2 or LIS=3

Change the optimization technique.

For example, if you use the default option, TECH=QUANEW, you can try one of the second-derivative methods if your problem is small or the conjugate gradient method if it is large.
Adjust finite difference derivatives.

The forward difference derivatives specified with the FD[=] or FDHESSIAN[=] option may not be precise enough to satisfy strong gradient termination criteria. You may need to specify the more expensive central difference formulas. The finite difference intervals may be too small or too big, and the finite difference derivatives may be erroneous.
Double-check the data entry and program specification.

Convergence to Stationary Point

The gradient at a stationary point is the null vector, which always leads to a zero search direction. This point satisfies the first-order termination criterion. Search directions that are based on the gradient are zero, so the algorithm terminates. There are two ways to avoid this situation:

Use the PARMS statement to specify a grid of feasible initial points.
Use the OPTCHECK[= r ] option to avoid terminating at the stationary point.

The signs of the eigenvalues of the (reduced) Hessian matrix contain information regarding a stationary point.

If all of the eigenvalues are positive, the Hessian matrix is positive definite, and the point is a minimum point.
If some of the eigenvalues are positive and all remaining eigenvalues are zero, the Hessian matrix is positive semidefinite, and the point is a minimum or saddle point.
If all of the eigenvalues are negative, the Hessian matrix is negative definite, and the point is a maximum point.
If some of the eigenvalues are negative and all of the remaining eigenvalues are zero, the Hessian matrix is negative semidefinite, and the point is a maximum or saddle point.
If all of the eigenvalues are zero, the point can be a minimum, maximum, or saddle point.

Precision of Solution

In some applications, PROC NLMIXED may result in parameter values that are not precise enough. Usually, this means that the procedure terminated at a point too far from the optimal point. The termination criteria define the size of the termination region around the optimal point. Any point inside this region can be accepted for terminating the optimization process. The default values of the termination criteria are set to satisfy a reasonable compromise between the computational effort (computer time) and the precision of the computed estimates for the most common applications. However, there are a number of circumstances in which the default values of the termination criteria specify a region that is either too large or too small.

If the termination region is too large, then it can contain points with low precision. In such cases, you should determine which termination criterion stopped the optimization process. In many applications, you can obtain a solution with higher precision simply by using the old parameter estimates as starting values in a subsequent run in which you specify a smaller value for the termination criterion that was satisfied at the former run.

If the termination region is too small, the optimization process may take longer to find a point inside such a region, or it may not even find such a point due to rounding errors in function values and derivatives. This can easily happen in applications in which finite difference approximations of derivatives are used and the GCONV and ABSGCONV termination criteria are too small to respect rounding errors in the gradient values.

Covariance Matrix

The estimated covariance matrix of the parameter estimates is computed as the inverse Hessian matrix, and for unconstrained problems it should be positive definite. If the final parameter estimates are subjected to n _act > 0 active linear inequality constraints, the formulas of the covariance matrices are modified similar to Gallant (1987) and Cramer (1986, p. 38) and additionally generalized for applications with singular matrices.

There are several steps available that enable you to tune the rank calculations of the covariance matrix.

You can use the ASINGULAR=, MSINGULAR=, and VSINGULAR= options to set three singularity criteria for the inversion of the Hessian matrix H . The singularity criterion used for the inversion is

where d _j,j is the diagonal pivot of the matrix H , and ASING, VSING, and MSING are the specified values of the ASINGULAR=, VSINGULAR=, and MSINGULAR= options. The default values are
- ASING: the square root of the smallest positive double precision value
- MSING: 1E ˆ’ 12 if you do not specify the SINGHESS= option and max(10 ˆˆ , 1E-4*SINGHESS) otherwise, where ˆˆ is the machine precision
- VSING: 1E ˆ’ 8 if you do not specify the SINGHESS= option and the value of SINGHESS otherwise
Note that, in many cases, a normalized matrix D ^{ˆ’ 1} AD ^{ˆ’ 1} is decomposed, and the singularity criteria are modified correspondingly.
If the matrix H is found to be singular in the first step, a generalized inverse is computed. Depending on the G4= option, either a generalized inverse satisfying all four Moore-Penrose conditions is computed or a generalized inverse satisfying only two Moore-Penrose conditions is computed. If the number of parameters n of the application is less than or equal to G4= i , a G4 inverse is computed; otherwise, only a G2 inverse is computed. The G4 inverse is computed by the (computationally very expensive but numerically stable) eigenvalue decomposition, and the G2 inverse is computed by Gauss transformation. The G4 inverse is computed using the eigenvalue decomposition A = Z Z ^T , where Z is the orthogonal matrix of eigenvectors and is the diagonal matrix of eigenvalues, = diag ( » ₁ , , » _n ). The G4 inverse of H is set to

where the diagonal matrix is defined using the COVSING= option.

If you do not specify the COVSING= option, the nr smallest eigenvalues are set to zero, where nr is the number of rank deficiencies found in the first step.

For optimization techniques that do not use second-order derivatives, the covariance matrix is computed using finite difference approximations of the derivatives.

Prediction

The nonlinear mixed model is a useful tool for statistical prediction. Assuming a prediction is to be made regarding the i th subject, suppose that f ( , u _i ) is a differentiable function predicting some quantity of interest. Recall that denotes the vector of unknown parameters and u _i denotes the vector of random effects for the i th subject. A natural point prediction is f ( , » _i ), where is the maximum likelihood estimate of and » _i is the empirical Bayes estimate of u _i described previously in Integral Approximations.

An approximate prediction variance matrix for ( , » _i ) is

where is the approximate Hessian matrix from the optimization for , is the approximate Hessian matrix from the optimization for » _i , and ( ˆ‚ » _i / ˆ‚ ) is the derivative of » _i with respect to , evaluated at ( , » _i ). The approximate variance matrix for is the standard one discussed in the previous section, and that for » _i is an approximation to the conditional mean squared error of prediction described by Booth and Hobert (1998).

The prediction variance for a general scalar function f ( , u _i ) is defined as the expected squared difference . PROC NLMIXED computes an approximation to it as follows. The derivative of f ( , u _i ) is computed with respect to each element of ( , u _i ) and evaluated at ( , » _i ). If a _i is the resulting vector, then the approximate prediction variance is . This approximation is known as the delta method (Billingsley, 1986; Cox, 1998).

Computational Resources

Since nonlinear optimization is an iterative process that depends on many factors, it is difficult to estimate how much computer time is necessary to find an optimal solution satisfying one of the termination criteria. You can use the MAXTIME=, MAXITER=, and MAXFU= options to restrict the amount of CPU time, the number of iterations, and the number of function calls in a single run of PROC NLMIXED.

In each iteration k , the NRRIDG technique uses a symmetric Householder transformation to decompose the n — n Hessian matrix H

to compute the (Newton) search direction s

The TRUREG and NEWRAP techniques use the Cholesky decomposition to solve the same linear system while computing the search direction. The QUANEW, DBLDOG, CONGRA, and NMSIMP techniques do not need to invert or decompose a Hessian matrix; thus, they require less computational resources than the other techniques.

The larger the problem, the more time is needed to compute function values and derivatives. Therefore, you may want to compare optimization techniques by counting and comparing the respective numbers of function, gradient, and Hessian evaluations.

Finite difference approximations of the derivatives are expensive because they require additional function or gradient calls:

forward difference formulas
- For first-order derivatives, n additional function calls are required.
- For second-order derivatives based on function calls only, for a dense Hessian, n + n ² / 2 additional function calls are required.
- For second-order derivatives based on gradient calls, n additional gradient calls are required.

central difference formulas
- For first-order derivatives, 2 n additional function calls are required.
- For second-order derivatives based on function calls only, for a dense Hessian, 2 n +2 n ² additional function calls are required.
- For second-order derivatives based on gradient calls, 2 n additional gradient calls are required.

Many applications need considerably more time for computing second-order derivatives (Hessian matrix) than for computing first-order derivatives (gradient). In such cases, a dual quasi-Newton technique is recommended, which does not require second-order derivatives.

Starting Gradient and Hessian

The START option in the PROC NLMIXED statement displays the gradient of the negative log likelihood function at the starting values of the parameters. If you also specify the HESS option, then the starting Hessian is displayed as well.

Iterations

The iteration history consists of one line of output for each iteration in the optimization process. The iteration history is displayed by default because it is important that you check for possible convergence problems. The default iteration history includes the following variables:

Iter, the iteration number
Calls, the number of function calls
NegLogLike, the value of the objective function
Diff, the difference between adjacent function values
MaxGrad, the maximum of the absolute (projected) gradient components (except NMSIMP)
Slope, the slope g ^T s of the search direction s at the current parameter iterate ₍ _k ₎ (QUANEW only)
Rho, the ratio between the achieved and predicted value of Diff (NRRIDG only)
Radius, the radius of the trust region (TRUREG only)
StdDev, the standard deviation of the simplex values (NMSIMP only)
Delta, the vertex length of the simplex (NMSIMP only)
Size, the size of the simplex (NMSIMP only)

For the QUANEW method, the value of Slope should be significantly negative. Otherwise, the line-search algorithm has difficulty reducing the function value sufficiently. If this difficulty is encountered , an asterisk (*) appears after the iteration number. If there is a tilde (~) after the iteration number, the BFGS update is skipped, and very high values of the Lagrange function are produced. A backslash (\ ) after the iteration number indicates that Powell s correction for the BFGS update is used.

For methods using second derivatives, an asterisk (*) after the iteration number means that the computed Hessian approximation was singular and had to be ridged with a positive value.

For the NMSIMP method, only one line is displayed for several internal iterations. This technique skips the output for some iterations because some of the termination tests (StdDev and Size) are rather time consuming compared to the simplex operations, and they are performed only every five simplex operations.

The ITDETAILS option in the PROC NLMIXED statement provides a more detailed iteration history. Besides listing the current values of the parameters and their gradients, the following values are provided in addition to the default output:

Restart, the number of iteration restarts
Active, the number of active constraints
Lambda, the value of the Lagrange multiplier (TRUREG and DBLDOG only)
Ridge, the ridge value (NRRIDG only)
Alpha, the line-search step size (QUANEW only)

An apostrophe ( ) trailing the number of active constraints indicates that at least one of the active constraints was released from the active set due to a significant Lagrange multiplier.

Fitting Information

The Fitting Information table lists the final minimized value of ˆ’ 2 times the log likelihood as well as the information criteria of Akaike (AIC) and Schwarz (BIC), as well as a finite-sample corrected version of AIC (AICC). The criteria are computed as follows:

where f () is the negative of the marginal log likelihood function, is the vector of parameter estimates, p is the number of parameters, n is the number of observations, and s is the number of subjects. Refer to Hurvich and Tsai (1989) and Burnham and Anderson (1998) for additional details.

Parameter Estimates

The Parameter Estimates table lists the estimates of the parameter values after successful convergence of the optimization problem or the final values of the parameters under nonconvergence. If the problem did converge, standard errors are computed from the final Hessian matrix. The ratio of the estimate with its standard error produces a t value, with approximate degrees of freedom computed as the number of subjects minus the number of random effects. A p -value and confidence limits based on this t distribution are also provided. Finally, the gradient of the negative log likelihood function is displayed for each parameter, and you should verify that they each are sufficiently small for non-constrained parameters.

Covariance and Correlation Matrices

Following standard maximum likelihood theory (for example, Serfling 1980), the asymptotic variance-covariance matrix of the parameter estimates equals the inverse of the Hessian matrix. You can display this matrix with the COV option in the PROC NLMIXED statement. The corresponding correlation form is available with the CORR option.

Additional Estimates

The Additional Estimates table displays the results of all ESTIMATE statements that you specify, with the same columns as the Parameter Estimates table. The ECOV and ECORR options in the PROC NLMIXED statement produce tables displaying the approximate covariance and correlation matrices of the additional estimates. They are computed using the delta method (Billingsley 1986; Cox 1998). The EDER option in the PROC NLMIXED statement produces a table displaying the derivatives of the additional estimates with respect to the model parameters evaluated at their final estimated values.

ODS Table Names

PROC NLMIXED assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 51.2: ODS Tables Produced in PROC NLMIXED
ODS Table Name	Description	Statement or Option
AdditionalEstimates	Results from ESTIMATE statements	ESTIMATE
ConvergenceStatus	Convergence status	default
CorrMatAddEst	Correlation matrix of additional estimates	ECORR
CorrMatParmEst	Correlation matrix of parameter estimates	CORR
CovMatAddEst	Covariance matrix of additional estimates	ECOV
CovMatParmEst	Covariance matrix of parameter estimates	COV
DerAddEst	Derivatives of additional estimates	EDER
Dimensions	Dimensions of the problem	default
FitStatistics	Fit statistics	default
Hessian	Second derivative matrix	HESS
IterHistory	Iteration history	default
Parameters	Parameters	default
ParameterEstimates	Parameter estimates	default
Specifications	Model specifications	default
StartingHessian	Starting hessian matrix	START HESS
StartingValues	Starting values and gradient	START