Getting Started


In activities such as reservoir estimation in mining, petroleum exploration, and environmental modeling of air and water pollution, it often happens that data on one or more quantities are available at given spatial locations, and the goal is to predict the measured quantities at unsampled locations. Often, these unsampled locations are on a regular grid, and the predictions are used to produce surface plots or contour maps.

A popular method of spatial prediction is ordinary kriging, which produces both predicted values and associated standard errors. Ordinary kriging requires the complete specification (the form and parameter values) of the spatial dependence of the spatial process in terms of a covariance or semivariogram model.

Typically the semivariogram model is not known in advance and must be estimated, either visually or by some estimation method.

PROC VARIOGRAM computes the sample semivariogram, from which you can find a suitable theoretical semivariogram by visual methods .

The following example goes through a typical problem to show how you can compute a sample variogram and determine an appropriate theoretical model.

Preliminary Spatial Data Analysis

The simulated data consist of coal seam thickness measurements (in feet) taken over an approximately square area. The coordinates are offsets from a point in the south-west corner of the measurement area, with the north and east distances in units of thousands of feet.

First, the data are input.

  data thick;   input east north thick @@;   datalines;   0.7  59.6  34.1   2.1  82.7  42.2   4.7  75.1  39.5   4.8  52.8  34.3   5.9  67.1  37.0   6.0  35.7  35.9   6.4  33.7  36.4   7.0  46.7  34.6   8.2  40.1  35.4   13.3   0.6  44.7  13.3  68.2  37.8  13.4  31.3  37.8   17.8   6.9  43.9  20.1  66.3  37.7  22.7  87.6  42.8   23.0  93.9  43.6  24.3  73.0  39.3  24.8  15.1  42.3   24.8  26.3  39.7  26.4  58.0  36.9  26.9  65.0  37.8   27.7  83.3  41.8  27.9  90.8  43.3  29.1  47.9  36.7   29.5  89.4  43.0  30.1   6.1  43.6  30.8  12.1  42.8   32.7  40.2  37.5  34.8   8.1  43.3  35.3  32.0  38.8   37.0  70.3  39.2  38.2  77.9  40.7  38.9  23.3  40.5   39.4  82.5  41.4  43.0   4.7  43.3  43.7   7.6  43.1   46.4  84.1  41.5  46.7  10.6  42.6  49.9  22.1  40.7   51.0  88.8  42.0  52.8  68.9  39.3  52.9  32.7  39.2   55.5  92.9  42.2  56.0   1.6  42.7  60.6  75.2  40.1   62.1  26.6  40.1  63.0  12.7  41.8  69.0  75.6  40.1   70.5  83.7  40.9  70.9  11.0  41.7  71.5  29.5  39.8   78.1  45.5  38.7  78.2   9.1  41.7  78.4  20.0  40.8   80.5  55.9  38.7  81.1  51.0  38.6  83.8   7.9  41.6   84.5  11.0  41.5  85.2  67.3  39.4  85.5  73.0  39.8   86.7  70.4  39.6  87.2  55.7  38.8  88.1   0.0  41.6   88.4  12.1  41.3  88.4  99.6  41.2  88.8  82.9  40.5   88.9   6.2  41.5  90.6   7.0  41.5  90.7  49.6  38.9   91.5  55.4  39.0  92.9  46.8  39.1  93.4  70.9  39.7   94.8  71.5  39.7  96.2  84.3  40.3  98.2  58.2  39.5   ;  

It is instructive to see the locations of the measured points in the area where you want to perform spatial prediction. It is desirable to have these locations scattered evenly around the prediction area. If this is not the case, the prediction error might be unacceptably large where measurements are sparse. The following GPLOT procedure is useful in determining potential problems:

  proc gplot data=thick;   title 'Scatter Plot of Measurement Locations';   plot north*east / frame cframe=ligr haxis=axis1   vaxis=axis2;   symbol1 v=dot color=blue;   axis1 minor=none;   axis2 minor=none label=(angle=90 rotate=0);   label east   = 'East'   north  = 'North'   ;   run;  

As Figure 80.1 indicates, while the locations are not ideally spread around the prediction area, there are not any large areas lacking measurements. You now can look at a surface plot of the measured variable, the thickness of coal seam, using the G3D procedure. This is a crucial step. Any obvious surface trend has to be removed before you compute and estimate the model of spatial dependence (the semivariogram model).

click to expand
Figure 80.1: Scatter Plot of Measurement Locations
  proc g3d data=thick;   title 'Surface Plot of Coal Seam Thickness';   scatter east*north=thick / xticknum=5 yticknum=5   grid zmin=20 zmax=65;   label east  = 'East'   north = 'North'   thick = 'Thickness'   ;   run;  

Figure 80.2 shows the small-scale variation typical of spatial data, but there does not appear to be any surface trend. Hence, you can work with the original thickness data rather than residuals from a trend surface fit.

click to expand
Figure 80.2: Surface Plot of Coal Seam Thickness

Preliminary Variogram Analysis

Recall that the goal of this example is spatial prediction. In particular, you would like to produce a contour map or surface plot on a regular grid of predicted values based on ordinary kriging. Ordinary kriging requires the complete specification of the spatial covariance or semivariogram.

You can use PROC VARIOGRAM, along with a DATA step and PROC GPLOT, to estimate visually a reasonable semivariogram model (both the form and associated parameters) for the thickness data.

Before proceeding with this estimation, consider the formula for the empirical or experimental semivariogram ³ z ( h ). Denote the coal seam thickness process by { Z ( r ) , r ˆˆ D R 2 }. You have measurements ( Z ( r i ) ,i =1 ,..., 75). The standard formula for ³ z ( h ) (isotropic case) is

click to expand

where N ( h ) is given by

click to expand

and N ( h ) is the number of such pairs ( i, j ).

For actual data, it is unlikely that any pair ( i, j ) would exactly satisfy r i ˆ’ r j = h , so typically a range of pairwise distances, r i ˆ’ r j ˆˆ [ h ˆ’ h,h + h ), isused to group pairs ( r i , r j ) for a single term in the expression for ³ z ( h ). Using this range, N ( h ) is modified by

click to expand

PROC VARIOGRAM performs this grouping with two required options for variogram computation: the LAGDISTANCE= and MAXLAGS= options.

The meaning of the required LAGDISTANCE= option is as follows . Classify all pairs of points into intervals according to their pairwise distance. The width of the distance interval is the LAGDISTANCE= value. The meaning of the required MAXLAGS= option is simply the number of intervals.

The problem is that a surface plot of the original data, or the scatter plot of the measurement locations, is not very helpful in determining the distribution of these pairwise distances; it is not clear what values to give to the LAGDISTANCE= and MAXLAGS= options.

You use PROC VARIOGRAM with the OUTDISTANCE= option to produce a modified histogram of the pairwise distances in order to find reasonable values for the LAGDISTANCE= and MAXLAGS= options. In the following analysis, you use the NOVARIOGRAM option in the COMPUTE statement and the OUTDISTANCE= option in the PROC VARIOGRAM statement. You need the NOVARIOGRAM option to keep an error message from being issued due to the absence of the LAGDISTANCE= and MAXLAGS= options.

The DATA step after the PROC VARIOGRAM statement computes the midpoint of each distance interval. This midpoint is then used in the GCHART procedure. Since the number of distance intervals is not specified by using the NHCLASSES= option in the COMPUTE statement, the default of 10 is used.

  proc variogram data=thick outdistance=outd;   compute novariogram;   coordinates xc=east yc=north;   var thick;   run;   title 'OUTDISTANCE= Data Set Showing Distance Intervals';   proc print data=outd;   run;   data outd; set outd;   mdpt=round((lb+ub)/2,.1);   label mdpt = 'Midpoint of Interval';   run;   axis1 minor=none;   axis2 minor=none label=(angle=90 rotate=0);   title 'Distribution of Pairwise Distances';   proc gchart data=outd;   vbar mdpt / type=sum sumvar=count discrete frame   cframe=ligr gaxis=axis1 raxis=axis2 nolegend;   run;  
start figure
  OUTDISTANCE= Data Set Showing Distance Intervals   Obs VARNAME    LAG         LB       UB      COUNT     PER   1  thick       0      0.000      6.969      45    0.01622   2  thick       1      6.969     20.907     263    0.09477   3  thick       2     20.907     34.845     383    0.13802   4  thick       3     34.845     48.783     436    0.15712   5  thick       4     48.783     62.720     495    0.17838   6  thick       5     62.720     76.658     525    0.18919   7  thick       6     76.658     90.596     412    0.14847   8  thick       7     90.596    104.534     179    0.06450   9  thick       8    104.534    118.472      35    0.01261   10  thick       9    118.472    132.410       2    0.00072   11  thick      10    132.410    146.348       0    0.00000  
end figure

Figure 80.3: OUTDISTANCE= Data Set Showing Distance Intervals
click to expand
Figure 80.4: Distribution of Pairwise Distances

For plotting and estimations purposes, it is desirable to have as many points as possible for the plot of ³ z ( h ) against h . This corresponds to having as many distance intervals as possible, that is, having a small value for the LAGDISTANCE= option.

However, a rule of thumb used in computing sample semivariograms is to use at least 30 point pairs in computing a single value of the empirical or experimental semivariogram.

If the LAGDISTANCE= value is set too small, there may be too few points in one or more of the intervals. On the other hand, if the LAGDISTANCE= value is set to a large value, the number of point pairs in the distance intervals may be much greater than that needed for estimation precision, thereby 'wasting' point pairs at the expense of variogram points.

Hence, there is a tradeoff between the number of distance intervals and the number of point pairs within each interval.

As discussed in the section 'OUTDIST =SAS-data-set ' on page 4878 the first few distance intervals, corresponding to lag 0 and lag 1, are typically the limiting intervals. This is particularly true for lag 0 since it is half the width of the remaining intervals. For the default of NHCLASSES=10, the lag 0 class contains 45 points, which is reasonably close to 30, but the lag 1 class contains 263 points.

If you rerun PROC VARIOGRAM with NHCLASSES=20, these numbers become 8 and 83 for lags 0 and 1, respectively. Because of the asymmetrical nature of lag 0, you are willing to violate the rule of thumb for the 0th lag. You will, however, have sufficient numbers in lag 1 and above.

  proc variogram data=thick outdistance=outd;   compute nhc=20 novariogram;   coordinates xc=east yc=north;   var thick;   run;   title 'OUTDISTANCE= Data Set Showing Distance Intervals';   proc print data=outd;   run;   data outd; set outd;   mdpt=round((lb+ub)/2,.1);   label mdpt = 'Midpoint of Interval';   run;   axis1 minor=none;   axis2 minor=none label=(angle=90 rotate=0);   title 'Distribution of Pairwise Distances';   proc gchart data=outd;   vbar mdpt / type=sum sumvar=count discrete frame   cframe=ligr gaxis=axis1 raxis=axis2 nolegend;   run;  
click to expand
Figure 80.6: Distribution of Pairwise Distances

The length of the lag1 class (3 . 484 , 10 . 453) is 6.969. You round off and use LAGDISTANCE=7.0 in the COMPUTE statement.

The use of the MAXLAGS= option is more difficult. From Figure 80.5, note that up to a pairwise distance of 101, you have a sufficient number of pairs. With your choice of LAGDISTANCE=7.0, this yields a maximum number of lags .

start figure
  OUTDISTANCE= Data Set Showing Distance Intervals   Obs   VARNAME    LAG         LB       UB      COUNT      PER   1     thick       0      0.000      3.484       8     0.00288   2     thick       1      3.484     10.453      83     0.02991   3     thick       2     10.453     17.422     143     0.05153   4     thick       3     17.422     24.391     167     0.06018   5     thick       4     24.391     31.360     198     0.07135   6     thick       5     31.360     38.329     197     0.07099   7     thick       6     38.329     45.298     203     0.07315   8     thick       7     45.298     52.267     235     0.08468   9     thick       8     52.267     59.236     234     0.08432   10     thick       9     59.236     66.205     284     0.10234   11     thick      10     66.205     73.174     264     0.09514   12     thick      11     73.174     80.143     236     0.08505   13     thick      12     80.143     87.112     221     0.07964   14     thick      13     87.112     94.081     165     0.05946   15     thick      14     94.081    101.050      75     0.02703   16     thick      15    101.050    108.018      41     0.01477   17     thick      16    108.018    114.987      15     0.00541   18     thick      17    114.987    121.956       5     0.00180   19     thick      18    121.956    128.925       1     0.00036   20     thick      19    128.925    135.894       0     0.00000   21     thick      20    135.894    142.863       0     0.00000  
end figure

Figure 80.5: OUTDISTANCE= Data Set Showing Distance Intervals

The problem with using the maximum lag value is that it includes pairs of points so far apart that they are likely to be independent. Using pairs of points that are independent adds nothing to the empirical semivariogram plot; they are essentially added noise.

If there is an estimate of correlation length, perhaps from a prior geologic study of a similar site, you can specify the MAXLAGS= value so that the maximum pairwise distance does not exceed two or three correlation lengths. If there is no estimate of correlation length, you can use the following rule of thumb: use 1/2 to of the 'diameter' of the region containing the data. A MAXLAGS= value of 10 is within this range.

You now rerun PROC VARIOGRAM with these values.

Sample Variogram Computation and Plots

Using the values of LAGDISTANCE=7.0 and MAXLAGS=10 computed previously, rerun PROC VARIOGRAM without the NOVARIOGRAM option. Also, request a robust version of the semivariogram; then, plot both results against the pairwise distance of each class.

  proc variogram data=thick outv=outv;   compute lagd=7 maxlag=10 robust;   coordinates xc=east yc=north;   var thick;   run;   title 'OUTVAR= Data Set Showing Sample Variogram Results';   proc print data=outv label;   var lag count distance variog rvario;   run;   data outv2; set outv;   vari=variog; type = 'regular'; output;   vari=rvario; type = 'robust'; output;   run;   title 'Standard and Robust Semivariogram for Coal Seam   Thickness Data';   proc gplot data=outv2;   plot vari*distance=type / frame cframe=ligr vaxis=axis2   haxis=axis1;   symbol1 i=join l=1 c=blue   /* v=star   */;   symbol2 i=join l=1 c=yellow /* v=square */;   axis1 minor=none   label=(c=black 'Lag Distance') /* offset=(3,3) */;   axis2 order=(0 to 9 by 1) minor=none   label=(angle=90 rotate=0 c=black 'Variogram')   /* offset=(3,3) */;   run;  
start figure
  OUTVAR= Data Set Showing Sample Variogram Results   Lag Class   Value (in    Number of    Average Lag   LAGDIST=     Pairs in      Distance                    Robust   Obs     units)       Class       for Class     Variogram    Variogram   1   1           75           .            .            .   2         0            8          2.5045       0.02937      0.01694   3         1           85          7.3625       0.38047      0.19807   4         2          142         14.1547       1.15158      0.98029   5         3          169         21.0913       2.79719      3.01412   6         4          199         27.9691       4.68769      4.86998   7         5          199         35.1591       6.16018      6.15639   8         6          205         42.2547       7.58912      8.05072   9         7          232         48.7775       7.12506      7.07155   10         8          244         56.1824       7.04832      7.62851   11         9          285         62.9121       6.66298      8.02993   12        10          262         69.8925       6.18775      7.92206  
end figure

Figure 80.7: OUTVAR= Data Set Showing Sample Variogram Results

Figure 80.8 shows first a slow, then a rapid rise from the origin, suggesting a Gaussian type form:

click to expand
Figure 80.8: Standard and Robust Semivariogram for Coal Seam Thickness Data
click to expand

See the section 'Theoretical and Computational Details of the Semivariogram' on page 4872 for graphs of the standard semivariogram forms.

By experimentation, you find that a scale of c =7 . 5 and a range of a = 30 fits reasonably well for both the robust and standard semivariogram

The following statements plot the sample and theoretical variograms:

  data outv3; set outv;   c0=7.5; a0=30;   vari = c0*(1-exp(-distance*distance/(a0*a0)));   type = 'Gaussian'; output;   vari = variog; type = 'regular'; output;   vari = rvario; type = 'robust'; output;   run;   title 'Theoretical and Sample Semivariogram for Coal Seam   Thickness Data';   proc gplot data=outv3;   plot vari*distance=type / frame cframe=ligr vaxis=axis2   haxis=axis1;   symbol1 i=join l=1 c=blue    /* v=star    */;   symbol2 i=join l=1 c=yellow  /* v=square  */;   symbol3 i=join l=1 c=cyan    /* v=diamond */;   axis1 minor=none   label=(c=black 'Lag Distance') /* offset=(3,3) */;   axis2 order=(0 to 9 by 1) minor=none   label=(angle=90 rotate=0 c=black 'Variogram')   /* offset=(3,3) */;   run;  

Figure 80.9 shows that the choice of a semivariogram model is adequate. You can use this Gaussian form and these particular parameters in PROC KRIGE2D to produce a contour plot of the kriging estimates and the associated standard errors.

click to expand
Figure 80.9: Theoretical and Sample Semivariogram for Coal Seam Thickness Data



SAS.STAT 9.1 Users Guide (Vol. 7)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 132

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net