Getting Started


This section illustrates how PROC MODECLUS can be used to examine the clusters of data in the following artificial data set.

  data example;   input x y @@;   datalines;   18 18 20 22 21 20 12 23 17 12 23 25 25 20  16 27   20 13 28 22 80 20 75 19 77 23 81 26 55 21  64 24   72 26 70 35 75 30 78 42 18 52 27 57 41 61  48 64   59 72 69 72 80 80 31 53 51 69 72 81   ;  

It is a good practice to plot the data to check for obvious clusters or pathologies prior to the analysis. The interactive graphics of the SAS/INSIGHT product are effective for visualizing clusters. In this example, with only two variables and a small sample size , the GPLOT procedure is adequate. The following statements produce Figure 47.1:

  axis1 label=(angle=90 rotate=0) minor=none   order=(0 to 80 by 20);   axis2 minor=none;   proc gplot;   plot y*x /frame cframe=ligr vaxis=axis1 haxis=axis2;   run;  

The plot suggests three clusters. Of these clusters, the one in the lower left corner is the most compact, while the lower right cluster is more dispersed.

The upper cluster is elongated and would be difficult for most clustering algorithms to identify as a single cluster. The plot also suggests that a Euclidean distance of 10 or 20 is a good initial guess for the neighborhood size in density estimation and clustering.

click to expand
Figure 47.1: Scatter Plot of Data

To obtain a cluster analysis, you must specify the METHOD= option; for most purposes, METHOD=1 is recommended. The cluster analysis can be performed with a list of radii (R=10 15 35), as illustrated in the following PROC MODECLUS step. An output data set containing the cluster membership is created with the OUT= option and then used by PROC GPLOT to display the membership. The following statements produce Figure 47.2 through Figure 47.5:

  proc modeclus data=example method=1 r=10 15 35 out=out;   run;  

For each cluster solution, PROC MODECLUS produces a table of cluster statistics including the cluster number, the number of observations in the cluster, the maximum estimated density within the cluster, the number of observations in the cluster having a neighbor that belongs to a different cluster, and the estimated saddle density of the cluster. The results are displayed in Figure 47.2, Figure 47.3,and Figure 47.4 for three different radii. A smaller radius (R=10) yields a larger number of clusters (6), as displayed in Figure 47.1; a larger radius (R=35) includes all observations in a single cluster, as displayed in Figure 47.5. Note that all clusters in these three figures are isolated since their corresponding boundary frequencies are all 0s. Therefore, all the estimated saddle densities are missing.

start figure
  The MODECLUS Procedure   R=10  METHOD=1   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                  10    0.00106103             0             .   2                   9    0.00084883             0             .   3                   7    0.00031831             0             .   4                   2    0.00021221             0             .   5                   1     0.0001061             0             .   6                   1     0.0001061             0             .  
end figure

Figure 47.2: Results from PROC MODECLUS for METHOD=1 and R=10
start figure
  The MODECLUS Procedure   R=15  METHOD=1   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                  10    0.00047157             0             .   2                  10    0.00042441             0             .   3                  10    0.00023579             0             .  
end figure

Figure 47.3: Results from PROC MODECLUS for METHOD=1 and R=15
start figure
  The MODECLUS Procedure   R=35  METHOD=1   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                  30    0.00012126             0             .  
end figure

Figure 47.4: Results from PROC MODECLUS for METHOD=1 and R=35

A table summarizing each cluster solution is then produced, as displayed in Figure 47.5.

start figure
  The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R     Clusters         Objects   ------------------------------------   10            6               0   15            3               0   35            1               0  
end figure

Figure 47.5: Summary Table

The OUT= data set contains a complete copy of the input data set for each cluster solution. Using a BY statement in the following PROC GPLOT step, you can examine the differences in cluster memberships for each radius. The following statements produce Figure 47.6 through Figure 47.8:

  symbol1 v=1 font=swiss c=white; symbol2 v=2 font=swiss c=yellow;   symbol3 v=3  font=swiss c=cyan; symbol4 v=4 font=swiss c=green;   symbol5 v=5 font=swiss c=orange;symbol6 v=6 font=swiss c=blue;   symbol7 v=7 font=swiss c=black;   proc gplot data=out;   plot y*x=cluster /frame cframe=ligr nolegend vaxis=axis1   haxis=axis2;   by _r_;   run;  
click to expand
Figure 47.6: Scatter Plots of Cluster Memberships with _R_=10
click to expand
Figure 47.7: Scatter Plots of Cluster Memberships with _R_=15
click to expand
Figure 47.8: Scatter Plots of Cluster Memberships with _R_=35



SAS.STAT 9.1 Users Guide (Vol. 4)
SAS.STAT 9.1 Users Guide (Vol. 4)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 91

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net