Details | SAS/STAT 9.1 Users Guide, Volumes 1-7

Theoretical Semivariogram Models

The VARIOGRAM procedure computes the sample, or experimental semivariogram. Prediction of the spatial process at unsampled locations by techniques such as ordinary kriging requires a theoretical semivariogram or covariance.

It is necessary, then, to decide on a theoretical variogram based on the sample variogram. While there are methods of fitting semivariogram models, such as least squares, maximum likelihood , and robust methods (Cressie 1993, section 2.6), these techniques are not appropriate for data sets resulting in a small number of variogram points. Instead, a visual fit of the variogram points to a few standard models is often satisfactory. Even when there are sufficient variogram points, a visual check against a fitted theoretical model is appropriate (Hohn 1988, p. 25ff).

In some cases, a plot of the experimental semivariogram suggests that a single theoretical model is inadequate. Nested models, anisotropic models, and the nugget effect increase the scope of theoretical models available. All of these concepts are discussed in this section. The specification of the final theoretical model is provided by the syntax of PROC KRIGE2D.

Note the general flow of investigation. After a suitable choice is made of the LAGDIST= and MAXLAGS= options and, possibly, the NDIR= option (or a DIRECTIONS statement), the experimental semivariogram is computed. Potential theoretical models, possibly incorporating nesting, anisotropy, and the nugget effect, are computed by a DATA step, then they are plotted against the experimental semivariogram and evaluated. A suitable theoretical model is thus found visually, and the specification of the model is used in PROC KRIGE2D. This flow is illustrated in Figure 80.10; also see the 'Getting Started' section on page 4852 for an illustration in a simple case.

Figure 80.10: Flowchart for Variogram Selection

Theoretical and Computational Details of the Semivariogram

The basic starting point in computing the semivariogram is the enumeration of pairs of points for the spatial data. Figure 80.11 shows a spatial domain in which a set of measurements are made at the indicated locations. Two points P ₁ and P ₂ , with coordinates ( x ₁ , y ₁ ) , ( x ₂ , y ₂ ), are selected for illustration. A vector, or directed line segment, is drawn between these points. This pair is then categorized first by orientation of this directed line segment and then by its length. That is, the pair P ₁ P ₂ is placed into an angle and distance class.

Figure 80.11: Selection of Points P ₁ and P ₂ in Spatial Domain

Angle Classification

Suppose you specify NDIR=3 in the COMPUTE statement in PROC VARIOGRAM. This results in three angle classes defined by midpoint angles between 0 ^o and 180 ^o : 0 ^o ± , 60 ^o ± , and 120 ^o ± , where is the angle tolerance. If you do not specify an angle tolerance using the ATOL= option in the COMPUTE statement, the following default value is used.

For three classes, =30 ^o . When the example directed line segment P ₁ P ₂ is superimposed on the coordinate system showing the angle classes, its angle, measured clockwise from north, is approximately 45 ^o . In particular, it falls within [60 ^o ˆ’ , 60 ^o + ) = [30 ^o , 90 ^o ), the second angle class. See Figure 80.12.

Figure 80.12: Selected Pair P ₁ P ₂ Falls within the Second Angle Class

Note that if the designated points P ₁ and P ₂ are labeled in the opposite order, the orientation is in a reciprocal direction, that is, approximately 225 ^o for the point pair instead of approximately 45 ^o . This does not affect angle class selection; the angle classes [60 ^o ˆ’ , 60 ^o + ) and [240 ^o ˆ’ , 240 ^o + ) are the same.

If you specify an angle tolerance less than the default, for example, ATOL =15 ^o , some point pairs might be excluded. For example, the selected point pair P ₁ P ₂ in Figure 80.12, while closest to the 60 ^o axis, might lie outside [60 ˆ’ , 60 + )= [45 ^o , 75 ^o ). In this case, the point pair P ₁ P ₂ would be excluded from the variogram computation.

On the other hand, you can specify an angle tolerance greater than the default. This can result in a point pair being counted in more than one angle class. This has a smoothing effect on the variogram and is useful when there is a small amount of data available.

An alternative way to specify angle classes and angle tolerances is with the DIRECTIONS statement. The DIRECTIONS statement is useful when angle classes are not equally spaced . When you specify the DIRECTIONS statement, you should also specify the angle tolerance. The default value of the angle tolerance is 45 ^o when a DIRECTIONS statement is used instead of the NDIRECTIONS= option in the COMPUTE statement. This may not be appropriate for a particular set of angle classes. See the 'DIRECTIONS Statement' section on page 4870 for more details on the DIRECTIONS statement.

Distance Classification

Next, the distance class for the point pair P ₁ P ₂ is determined. The directed line segment P ₁ P ₂ is superimposed on the coordinate system showing the distance or lag classes. These classes are determined by the LAGD= specification in the COMPUTE statement. Denoting the length of the line segment by P ₁ P ₂ and the LAGD value by ” , the lag class L is determined by

where [ x ] denotes the largest integer ‰ x .

When the directed line segment P ₁ P ₂ is superimposed on the coordinate system showing the distance classes, it is seen to fall in the first lag class; see Figure 80.13 for an illustration for ” =1.

Figure 80.13: Selected Pair P ₁ P ₂ Falls within the First Lag Class

Because pairwise distances are positive, lag class zero is smaller than lag classes 1 , , MAXLAG ˆ’ 1. For example, if you specify LAGD=1.0 and MAXLAG=10, and you do not specify a LAGTOL= value in the COMPUTE statement in PROC VARIOGRAM, the ten lag classes generated by the preceding equation are

This is because the default lag tolerance is one-half the LAGD= value, resulting in no gaps between the distance class intervals. This is shown in Figure 80.14.

Figure 80.14: Lag Distance Axis Showing Lag Classes

On the other hand, if you do specify a distance tolerance with the DTOL= option in the COMPUTE statement, a further check is performed to see if the point pair falls within this tolerance of the nearest lag. In the preceding example, if you specify LAGD=1.0 and MAXLAG=10 (as before) and also specify LAGTOL=0.25, the intervals become

Note that this specification results in gaps in the lag classes; a point pair P ₁ P ₂ might fall, for example, in the interval

and hence be excluded from the semivariogram calculation. The maximum LAGTOL= value allowed is half the LAGD= value; no overlap of the distance classes is allowed.

Bandwidth Restriction

Because the areal segments generated from the angle and distance classes increase in area as the lag distance increases , it is sometimes desirable to restrict this area (Duetsch and Journel 1992, p. 45). If you specify the BANDW= option in the COMPUTE statement, the lateral, or perpendicular , distance from the axis defining the angle classes is fixed.

For example, suppose two points P ₃ , P ₄ are picked from the domain in Figure 80.11 and are superimposed on the grid defining distance and angle classes, as shown in Figure 80.15.

The endpoint of vector P ₃ P ₄ falls within the angle class around 60 ^o and the 5th lag class; however, it falls outside the restricted area defined by the bandwidth. Hence, it is excluded from the semivariogram calculation.

Finally, a pair P _i P _j that falls in a lag class larger than the value of the MAXLAG= option is excluded from the semivariogram calculation.

From this description, it is clear that the number of pairs within each angle/distance class is strongly affected by the angle and lag tolerances. Since it is desirable to have the maximum number of point pairs within each class, the angle tolerance and the distance tolerance should usually be the default values.

Semivariogram Computation

With the classification of a point pair P _i P _j into an angle/distance class, as shown in the preceding section, the semivariogram computation proceeds as follows .

Denote all pairs P _i P _j belonging to angle class [ _k ˆ’ _k , _k + _k ) and distance class L = L ( P _i P _j ) by N ( _k , L ). For example, in the preceding illustration, P ₁ P ₂ belongs to N (60 ^o , 1).

Let N ( _k , L ) denote the number of such pairs. Let V _i , V _j be the measured values at points P _i , P _j . The component of the standard (or method of moments) semivariogram

corresponding to angle/distance class N ( _k , L ) is given by

where h _k is the average distance in class N ( _k , L );that is,

The robust version of the semivariogram, as suggested by Cressie (1993), is given by

where

This robust version of the semivariogram is computed when you specify the ROBUST option in the COMPUTE statement in PROC VARIOGRAM.

PROC VARIOGRAM computes and writes to the OUTVAR= data set the quantities h _k , _k , L, N( _k ,L), ³ (h) , and ³ ( h ).

Output Data Sets

The VARIOGRAM procedure produces three data sets: the OUTVAR= SAS-data-set , the OUTPAIR= SAS-data-set , and the OUTDIST= SAS-data-set . These data sets are described in the following sections.

OUTVAR=SAS-data-set

The OUTVAR= data set contains the standard and robust versions of the sample semivariogram, the covariance, and other information at each lag class.

The details of the computation of the variogram, the robust variogram, and the covariance is described in the section 'Theoretical and Computational Details of the Semivariogram' on page 4872.

The OUTVAR= data set contains the following variables :

ANGLE , which is the angle class value (clockwise from N_S)
ATOL , which is the angle tolerance for the lag/angle class
AVERAGE , which is the average variable value for the lag/angle class
BANDW , which is the band width for the lag/angle class
COUNT , which is the number of pairs in the lag/angle class
COVAR , which is the covariance value for the lag/angle class
DISTANCE , which is the average lag distance for the lag/angle class
LAG , which is lag class value (in LAGDISTANCE= units)
RVARIO , which is the sample robust variogram value for the lag/angle class
VARIOG , which is the sample variogram value for the lag/angle class
VARNAME , which is the name of the current VAR= variable

The bandwidth variable, BANDW , is not included in the data set if no bandwidth specification is given in the COMPUTE statement or in a DIRECTIONS statement.

OUTDIST=SAS-data-set

The OUTDIST= data set contains counts for a modified histogram showing the distribution of pairwise distances. The purpose of this data set is to enable you to make choices for the value of the LAGDISTANCE= option in the COMPUTE statement in subsequent runs of PROC VARIOGRAM.

For plotting and estimation purposes, it is desirable to have as many points as possible for a variogram plot. However, a rule of thumb used in computing sample semivariograms is to use at least 30 points in each interval whenever possible. Hence, there is a lower limit to the value of the LAGDISTANCE= option.

Since the distribution of pairwise distances is seldom known in advance, the information contained in the OUTDIST= data set enables you to choose, in an iterative fashion, a value for the LAGDISTANCE= parameter. The value you choose is a compromise between the number of pairs making up each variogram point and the number of variogram points.

In some cases, the pattern of measured points may result in some lag or distance classes having a small number of pairs, while the remaining classes have a large number of pairs. By adjusting the value of the LAGDISTANCE= option to honor the rule of thumb (at least 30 pairs), you are 'wasting' pairs in the other distance classes.

One strategy for solving this problem is to use less than 30 pairs for these distance classes. Then, either delete the corresponding variogram points or use them and accept the increased uncertainty. Unfortunately, the deficient distance classes are usually those close to the origin ( h =0). This is the crucial portion of the experimental variogram curve for determining the form of the theoretical variogram and for detecting the presence of a nugget effect.

Another alternative is to force distance classes to contain approximately the same number of pairs. This results in distance classes of unequal widths.

While PROC VARIOGRAM does not produce such distance classes directly, the OUTPAIR= data set, described in the section 'OUTPAIR= SAS-data-set ' on page 4881, contains information on all distinct pairs of points. You can use this data set, along with the RANK procedure, to produce experimental variogram-based equal numbers of pairs in each distance class.

To request an OUTDIST= data set, you specify the OUTDIST= data set in the PROC VARIOGRAM statement and the NOVARIOGRAM option in the COMPUTE statement. The NOVARIOGRAM option prevents any variogram or covariance computation from being performed.

Computation of the Distribution Distance Classes

The simplest way of determining the distribution of pairwise distances is to determine the maximum distance h _max between pairs and divide this distance by some number N of intervals to produce distance classes of length . The distance between each pair of points P ₁ ,P ₂ , denoted P ₁ P ₂ , is computed, and the pair P ₁ P ₂ is counted in the k th distance class if P ₁ P ₂ ˆˆ [( k ˆ’ 1) , k ) for k = 1 , , N .

The actual computation is a slight variation of this. A bound, rather than the actual maximum distance, is computed. This bound is the length of the diagonal of a bounding rectangle for the data points. This bounding rectangle is found by using the maximum and minimum x and y coordinates, x _max , x _min , y _max , y _min , and forming the rectangle determined by the points

See Figure 80.16 for an illustration of the bounding rectangle.

Figure 80.16: Bounding Rectangle to Determine Maximum Pairwise Distance The pairwise distance bound, denoted by h _b , is given by

Using h _b , the interval (0 ,h _b ] is divided into N +1subintervals, where N is the value of the NHCLASSES= option specified in the COMPUTE statement, or N =10 if the NHCLASSES= option is not specified. The basic distance unit is ; the distance intervals are centered on h , 2 h , , N h , with a distance tolerance of . The extra subinterval is (0, h /2), corresponding to the 0th lag. It is half the length of the remaining subintervals, and it often contains the smallest number of pairs.

This method of partitioning the interval (0, h _b ] is identical to what is done when you actually compute the sample variogram.

The lag classes corresponding to h =1 are shown in Figure 80.17.

Figure 80.17: Lag Classes Corresponding to h =1

By increasing or decreasing the value of the NHCLASSES= option, you can adjust the lag or distance class with the smallest count so that this count is around 30 or some other value that you judge appropriate.

Once you determine an appropriate value for the NHCLASSES= option, you can use the width of the lag classes as a candidate value for the LAGDIST= option in the COMPUTE statement. The width of the lag classes is determined by the upper bound (UB) and lower bound (LB) variables.

For example, read the observation from the OUTDIST= data set corresponding to lag 1 and compute the quantity UB-LB. Use this value for the LAGDIST= option in the COMPUTE statement.

Note: Do not use the 0th lag class; it is half the length of the other intervals. Use lag 1 instead.

Variables in the OUTDIST= data set

The following variables are written to the OUTDIST= data set:

COUNT , which is the number of pairs falling into this lag class
LAG , which is the lag class value
LB , which is the lower bound of the lag class interval
UB , which is the upper bound of the lag class interval
PER , which is the percent of all pairs falling in this lag class
VARNAME , which is the name of the current VAR= variable

OUTPAIR=SAS-data-set

The OUTPAIR= data set contains one observation for each distinct pair of points P ₁ ,P ₂ in the original data set, unless you specify the OUTPDISTANCE= option in the COMPUTE statement.

If you specify OUTPDISTANCE= D _max in the COMPUTE statement, all pairs P ₁ ,P ₂ in the original data set that satisfy the relation P ₁ P ₂ ‰ D _max are written to the OUTPAIR= data set.

Note that the OUTPAIR= data set can be very large even for a moderately sized DATA= data set. For example, if the DATA= data set has NOBS=500,the OUTPAIR= data set has NOBS(NOBS ˆ’ 1) / 2= 124,750 if no OUTPDISTANCE= restriction is given in the COMPUTE statement.

The OUTPAIR= data set contains information on the distance and orientation for each point pair, and you can use it for specialized continuity measure calculations.

The OUTPAIR= data set contains the following variables:

AC, which is the angle class value
COS, which is the cosine of the angle between pairs
DC, which is the distance (lag) class
DISTANCE, which is the distance between pairs
V1, which is the variable value for the first point in the pair
V2, which is the variable value for the second point in the pair
VARNAME, which is the variable name for the current VAR variable
X1, which is the x coordinate of the first point in the pair
X2, which is the x coordinate of the second point in the pair
Y1, which is the y coordinate of the first point in the pair
Y2, which is the y coordinate of the second point in the pair

Computational Resources

The computations of the VARIOGRAM procedure are basically binning : for each pair of observations in the input data set, a distance and angle class is determined and recorded. Let N _d denote the number of distance classes, N _a denote the number of angle classes, and N _v denote the number of VAR variables. The memory requirements for these operations are proportional to N _d — N _a — N _v . This is typically small.

The CPU time required for the computations is proportional to the number of pairs of observations, or to N ² — N _v , where N is the number of observations in the input data set.