In activities such as reservoir estimation in mining, petroleum exploration, and environmental modeling of air and water pollution, it often happens that data on one or more quantities are available at given spatial locations, and the goal is to predict the measured quantities at unsampled locations. Often, these unsampled locations are on a regular grid, and the predictions are used to produce surface plots or contour maps.
A popular method of spatial prediction is ordinary kriging, which produces both predicted values and associated standard errors. Ordinary kriging requires the complete specification (the form and parameter values) of the spatial dependence of the spatial process in terms of a covariance or semivariogram model.
Typically the semivariogram model is not known in advance and must be estimated, either visually or by some estimation method.
PROC VARIOGRAM computes the sample semivariogram, from which you can find a suitable theoretical semivariogram by visual methods .
The following example goes through a typical problem to show how you can compute a sample variogram and determine an appropriate theoretical model.
The simulated data consist of coal seam thickness measurements (in feet) taken over an approximately square area. The coordinates are offsets from a point in the south-west corner of the measurement area, with the north and east distances in units of thousands of feet.
First, the data are input.
data thick; input east north thick @@; datalines; 0.7 59.6 34.1 2.1 82.7 42.2 4.7 75.1 39.5 4.8 52.8 34.3 5.9 67.1 37.0 6.0 35.7 35.9 6.4 33.7 36.4 7.0 46.7 34.6 8.2 40.1 35.4 13.3 0.6 44.7 13.3 68.2 37.8 13.4 31.3 37.8 17.8 6.9 43.9 20.1 66.3 37.7 22.7 87.6 42.8 23.0 93.9 43.6 24.3 73.0 39.3 24.8 15.1 42.3 24.8 26.3 39.7 26.4 58.0 36.9 26.9 65.0 37.8 27.7 83.3 41.8 27.9 90.8 43.3 29.1 47.9 36.7 29.5 89.4 43.0 30.1 6.1 43.6 30.8 12.1 42.8 32.7 40.2 37.5 34.8 8.1 43.3 35.3 32.0 38.8 37.0 70.3 39.2 38.2 77.9 40.7 38.9 23.3 40.5 39.4 82.5 41.4 43.0 4.7 43.3 43.7 7.6 43.1 46.4 84.1 41.5 46.7 10.6 42.6 49.9 22.1 40.7 51.0 88.8 42.0 52.8 68.9 39.3 52.9 32.7 39.2 55.5 92.9 42.2 56.0 1.6 42.7 60.6 75.2 40.1 62.1 26.6 40.1 63.0 12.7 41.8 69.0 75.6 40.1 70.5 83.7 40.9 70.9 11.0 41.7 71.5 29.5 39.8 78.1 45.5 38.7 78.2 9.1 41.7 78.4 20.0 40.8 80.5 55.9 38.7 81.1 51.0 38.6 83.8 7.9 41.6 84.5 11.0 41.5 85.2 67.3 39.4 85.5 73.0 39.8 86.7 70.4 39.6 87.2 55.7 38.8 88.1 0.0 41.6 88.4 12.1 41.3 88.4 99.6 41.2 88.8 82.9 40.5 88.9 6.2 41.5 90.6 7.0 41.5 90.7 49.6 38.9 91.5 55.4 39.0 92.9 46.8 39.1 93.4 70.9 39.7 94.8 71.5 39.7 96.2 84.3 40.3 98.2 58.2 39.5 ;
It is instructive to see the locations of the measured points in the area where you want to perform spatial prediction. It is desirable to have these locations scattered evenly around the prediction area. If this is not the case, the prediction error might be unacceptably large where measurements are sparse. The following GPLOT procedure is useful in determining potential problems:
proc gplot data=thick; title 'Scatter Plot of Measurement Locations'; plot north*east / frame cframe=ligr haxis=axis1 vaxis=axis2; symbol1 v=dot color=blue; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); label east = 'East' north = 'North' ; run;
As Figure 80.1 indicates, while the locations are not ideally spread around the prediction area, there are not any large areas lacking measurements. You now can look at a surface plot of the measured variable, the thickness of coal seam, using the G3D procedure. This is a crucial step. Any obvious surface trend has to be removed before you compute and estimate the model of spatial dependence (the semivariogram model).
proc g3d data=thick; title 'Surface Plot of Coal Seam Thickness'; scatter east*north=thick / xticknum=5 yticknum=5 grid zmin=20 zmax=65; label east = 'East' north = 'North' thick = 'Thickness' ; run;
Figure 80.2 shows the small-scale variation typical of spatial data, but there does not appear to be any surface trend. Hence, you can work with the original thickness data rather than residuals from a trend surface fit.
Recall that the goal of this example is spatial prediction. In particular, you would like to produce a contour map or surface plot on a regular grid of predicted values based on ordinary kriging. Ordinary kriging requires the complete specification of the spatial covariance or semivariogram.
You can use PROC VARIOGRAM, along with a DATA step and PROC GPLOT, to estimate visually a reasonable semivariogram model (both the form and associated parameters) for the thickness data.
Before proceeding with this estimation, consider the formula for the empirical or experimental semivariogram ³ z ( h ). Denote the coal seam thickness process by { Z ( r ) , r ˆˆ D ‚ R 2 }. You have measurements ( Z ( r i ) ,i =1 ,..., 75). The standard formula for ³ z ( h ) (isotropic case) is
where N ( h ) is given by
and N ( h ) is the number of such pairs ( i, j ).
For actual data, it is unlikely that any pair ( i, j ) would exactly satisfy r i ˆ’ r j = h , so typically a range of pairwise distances, r i ˆ’ r j ˆˆ [ h ˆ’ h,h + h ), isused to group pairs ( r i , r j ) for a single term in the expression for ³ z ( h ). Using this range, N ( h ) is modified by
PROC VARIOGRAM performs this grouping with two required options for variogram computation: the LAGDISTANCE= and MAXLAGS= options.
The meaning of the required LAGDISTANCE= option is as follows . Classify all pairs of points into intervals according to their pairwise distance. The width of the distance interval is the LAGDISTANCE= value. The meaning of the required MAXLAGS= option is simply the number of intervals.
The problem is that a surface plot of the original data, or the scatter plot of the measurement locations, is not very helpful in determining the distribution of these pairwise distances; it is not clear what values to give to the LAGDISTANCE= and MAXLAGS= options.
You use PROC VARIOGRAM with the OUTDISTANCE= option to produce a modified histogram of the pairwise distances in order to find reasonable values for the LAGDISTANCE= and MAXLAGS= options. In the following analysis, you use the NOVARIOGRAM option in the COMPUTE statement and the OUTDISTANCE= option in the PROC VARIOGRAM statement. You need the NOVARIOGRAM option to keep an error message from being issued due to the absence of the LAGDISTANCE= and MAXLAGS= options.
The DATA step after the PROC VARIOGRAM statement computes the midpoint of each distance interval. This midpoint is then used in the GCHART procedure. Since the number of distance intervals is not specified by using the NHCLASSES= option in the COMPUTE statement, the default of 10 is used.
proc variogram data=thick outdistance=outd; compute novariogram; coordinates xc=east yc=north; var thick; run; title 'OUTDISTANCE= Data Set Showing Distance Intervals'; proc print data=outd; run; data outd; set outd; mdpt=round((lb+ub)/2,.1); label mdpt = 'Midpoint of Interval'; run; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); title 'Distribution of Pairwise Distances'; proc gchart data=outd; vbar mdpt / type=sum sumvar=count discrete frame cframe=ligr gaxis=axis1 raxis=axis2 nolegend; run;
OUTDISTANCE= Data Set Showing Distance Intervals Obs VARNAME LAG LB UB COUNT PER 1 thick 0 0.000 6.969 45 0.01622 2 thick 1 6.969 20.907 263 0.09477 3 thick 2 20.907 34.845 383 0.13802 4 thick 3 34.845 48.783 436 0.15712 5 thick 4 48.783 62.720 495 0.17838 6 thick 5 62.720 76.658 525 0.18919 7 thick 6 76.658 90.596 412 0.14847 8 thick 7 90.596 104.534 179 0.06450 9 thick 8 104.534 118.472 35 0.01261 10 thick 9 118.472 132.410 2 0.00072 11 thick 10 132.410 146.348 0 0.00000
For plotting and estimations purposes, it is desirable to have as many points as possible for the plot of ³ z ( h ) against h . This corresponds to having as many distance intervals as possible, that is, having a small value for the LAGDISTANCE= option.
However, a rule of thumb used in computing sample semivariograms is to use at least 30 point pairs in computing a single value of the empirical or experimental semivariogram.
If the LAGDISTANCE= value is set too small, there may be too few points in one or more of the intervals. On the other hand, if the LAGDISTANCE= value is set to a large value, the number of point pairs in the distance intervals may be much greater than that needed for estimation precision, thereby 'wasting' point pairs at the expense of variogram points.
Hence, there is a tradeoff between the number of distance intervals and the number of point pairs within each interval.
As discussed in the section 'OUTDIST =SAS-data-set ' on page 4878 the first few distance intervals, corresponding to lag 0 and lag 1, are typically the limiting intervals. This is particularly true for lag 0 since it is half the width of the remaining intervals. For the default of NHCLASSES=10, the lag 0 class contains 45 points, which is reasonably close to 30, but the lag 1 class contains 263 points.
If you rerun PROC VARIOGRAM with NHCLASSES=20, these numbers become 8 and 83 for lags 0 and 1, respectively. Because of the asymmetrical nature of lag 0, you are willing to violate the rule of thumb for the 0th lag. You will, however, have sufficient numbers in lag 1 and above.
proc variogram data=thick outdistance=outd; compute nhc=20 novariogram; coordinates xc=east yc=north; var thick; run; title 'OUTDISTANCE= Data Set Showing Distance Intervals'; proc print data=outd; run; data outd; set outd; mdpt=round((lb+ub)/2,.1); label mdpt = 'Midpoint of Interval'; run; axis1 minor=none; axis2 minor=none label=(angle=90 rotate=0); title 'Distribution of Pairwise Distances'; proc gchart data=outd; vbar mdpt / type=sum sumvar=count discrete frame cframe=ligr gaxis=axis1 raxis=axis2 nolegend; run;
The length of the lag1 class (3 . 484 , 10 . 453) is 6.969. You round off and use LAGDISTANCE=7.0 in the COMPUTE statement.
The use of the MAXLAGS= option is more difficult. From Figure 80.5, note that up to a pairwise distance of 101, you have a sufficient number of pairs. With your choice of LAGDISTANCE=7.0, this yields a maximum number of lags .
OUTDISTANCE= Data Set Showing Distance Intervals Obs VARNAME LAG LB UB COUNT PER 1 thick 0 0.000 3.484 8 0.00288 2 thick 1 3.484 10.453 83 0.02991 3 thick 2 10.453 17.422 143 0.05153 4 thick 3 17.422 24.391 167 0.06018 5 thick 4 24.391 31.360 198 0.07135 6 thick 5 31.360 38.329 197 0.07099 7 thick 6 38.329 45.298 203 0.07315 8 thick 7 45.298 52.267 235 0.08468 9 thick 8 52.267 59.236 234 0.08432 10 thick 9 59.236 66.205 284 0.10234 11 thick 10 66.205 73.174 264 0.09514 12 thick 11 73.174 80.143 236 0.08505 13 thick 12 80.143 87.112 221 0.07964 14 thick 13 87.112 94.081 165 0.05946 15 thick 14 94.081 101.050 75 0.02703 16 thick 15 101.050 108.018 41 0.01477 17 thick 16 108.018 114.987 15 0.00541 18 thick 17 114.987 121.956 5 0.00180 19 thick 18 121.956 128.925 1 0.00036 20 thick 19 128.925 135.894 0 0.00000 21 thick 20 135.894 142.863 0 0.00000
The problem with using the maximum lag value is that it includes pairs of points so far apart that they are likely to be independent. Using pairs of points that are independent adds nothing to the empirical semivariogram plot; they are essentially added noise.
If there is an estimate of correlation length, perhaps from a prior geologic study of a similar site, you can specify the MAXLAGS= value so that the maximum pairwise distance does not exceed two or three correlation lengths. If there is no estimate of correlation length, you can use the following rule of thumb: use 1/2 to of the 'diameter' of the region containing the data. A MAXLAGS= value of 10 is within this range.
You now rerun PROC VARIOGRAM with these values.
Using the values of LAGDISTANCE=7.0 and MAXLAGS=10 computed previously, rerun PROC VARIOGRAM without the NOVARIOGRAM option. Also, request a robust version of the semivariogram; then, plot both results against the pairwise distance of each class.
proc variogram data=thick outv=outv; compute lagd=7 maxlag=10 robust; coordinates xc=east yc=north; var thick; run; title 'OUTVAR= Data Set Showing Sample Variogram Results'; proc print data=outv label; var lag count distance variog rvario; run; data outv2; set outv; vari=variog; type = 'regular'; output; vari=rvario; type = 'robust'; output; run; title 'Standard and Robust Semivariogram for Coal Seam Thickness Data'; proc gplot data=outv2; plot vari*distance=type / frame cframe=ligr vaxis=axis2 haxis=axis1; symbol1 i=join l=1 c=blue /* v=star */; symbol2 i=join l=1 c=yellow /* v=square */; axis1 minor=none label=(c=black 'Lag Distance') /* offset=(3,3) */; axis2 order=(0 to 9 by 1) minor=none label=(angle=90 rotate=0 c=black 'Variogram') /* offset=(3,3) */; run;
OUTVAR= Data Set Showing Sample Variogram Results Lag Class Value (in Number of Average Lag LAGDIST= Pairs in Distance Robust Obs units) Class for Class Variogram Variogram 1 1 75 . . . 2 0 8 2.5045 0.02937 0.01694 3 1 85 7.3625 0.38047 0.19807 4 2 142 14.1547 1.15158 0.98029 5 3 169 21.0913 2.79719 3.01412 6 4 199 27.9691 4.68769 4.86998 7 5 199 35.1591 6.16018 6.15639 8 6 205 42.2547 7.58912 8.05072 9 7 232 48.7775 7.12506 7.07155 10 8 244 56.1824 7.04832 7.62851 11 9 285 62.9121 6.66298 8.02993 12 10 262 69.8925 6.18775 7.92206
Figure 80.8 shows first a slow, then a rapid rise from the origin, suggesting a Gaussian type form:
See the section 'Theoretical and Computational Details of the Semivariogram' on page 4872 for graphs of the standard semivariogram forms.
By experimentation, you find that a scale of c =7 . 5 and a range of a = 30 fits reasonably well for both the robust and standard semivariogram
The following statements plot the sample and theoretical variograms:
data outv3; set outv; c0=7.5; a0=30; vari = c0*(1-exp(-distance*distance/(a0*a0))); type = 'Gaussian'; output; vari = variog; type = 'regular'; output; vari = rvario; type = 'robust'; output; run; title 'Theoretical and Sample Semivariogram for Coal Seam Thickness Data'; proc gplot data=outv3; plot vari*distance=type / frame cframe=ligr vaxis=axis2 haxis=axis1; symbol1 i=join l=1 c=blue /* v=star */; symbol2 i=join l=1 c=yellow /* v=square */; symbol3 i=join l=1 c=cyan /* v=diamond */; axis1 minor=none label=(c=black 'Lag Distance') /* offset=(3,3) */; axis2 order=(0 to 9 by 1) minor=none label=(angle=90 rotate=0 c=black 'Variogram') /* offset=(3,3) */; run;
Figure 80.9 shows that the choice of a semivariogram model is adequate. You can use this Gaussian form and these particular parameters in PROC KRIGE2D to produce a contour plot of the kriging estimates and the associated standard errors.