
Example 47.1. Cluster Analysis of Samples from Univariate Distributions

This example uses pseudo-random samples from a uniform distribution, an exponential distribution, and a bimodal mixture of two normal distributions. Results are presented in Output 47.1.1 through Output 47.1.3 as plots displaying both the true density and the estimated density, as well as cluster membership.

The following statements produce Output 47.1.1:

  options noovp ps=28 ls=95;   title 'Modeclus Example with Univariate Distributions';   title2 'Uniform Distribution';   data uniform;   drop n;   true=1;   do n=1 to 100;   x=ranuni(123);   output;   end;   axis1 label=(angle=90 rotate=0) minor=none   order=(0 to 3 by 0.5);   axis2 minor=none;   symbol9 v=none i=splines;   proc modeclus data=uniform m=1 k=10 20 40 60 out=out short;   var x;   proc gplot data=out;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _K_;   run;   proc modeclus data=uniform m=1 r=.05 .10 .20 .30   out=out short;   var x;   axis1 label=(angle=90 rotate=0)   minor=none order=(0 to 2 by 0.5);   proc gplot data=out;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _R_;   run;  
Output 47.1.1: Cluster Analysis of Sample from a Uniform Distribution
  Modeclus Example with Univariate Distributions   Uniform Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   K     Clusters         Objects   ------------------------------------   10            6               0   20            3               0   40            2               0   60            1               0  
  Modeclus Example with Univariate Distributions   Uniform Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R     Clusters         Objects   ------------------------------------   0.05            4               0   0.1            2               0   0.2            2               0   0.3            1               0  
The following statements produce Output 47.1.2:

  title2 'Exponential Distribution';   data expon;   drop n;   do n=1 to 100;   x=ranexp(123);   true=exp(-x);   output;   end;   axis1 label=(angle=90 rotate=0) minor=none   order=(0 to 2 by 0.5);   axis2 minor=none;   proc modeclus data=expon m=1 k=10 20 40 out=out short;   var x;   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _K_;   run;   /*********************************************/   proc modeclus data=expon m=1 r=.20 .40 .80 out=out short;   var x;   axis1 label=(angle=90 rotate=0)   minor=none order=(0 to 1 by 0.5);   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _R_;   run;   /*********************************************/   title3 Different Density-Estimation and Clustering Windows;   proc modeclus data=expon m=1 r=.20 ck=10 20 40   out=out short;   var x;   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _CK_;   run;   /*********************************************/   title3 'Cascaded Density Estimates Using Arithmetic Means';   proc modeclus data=expon m=1 r=.20 cascade=1 2 4 am out=out short;   var x;   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _R_ _CASCAD_;   run;  
Output 47.1.2: Cluster Analysis of Sample from an Exponential Distribution
  Modeclus Example with Univariate Distributions   Exponential Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   K     Clusters         Objects   -------------------------------------   10            5               0   20            3               0   40            1               0  
  Modeclus Example with Univariate Distributions   Exponential Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R     Clusters         Objects   ------------------------------------   0.2            8               0   0.4            6               0   0.8            1               0  
  Modeclus Example with   Different Density-Estimation and Clustering Windows   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R         CK     Clusters         Objects   ------------------------------------------------   0.2         10            3               0   0.2         20            2               0   0.2         40            1               0  
  Modeclus Example with   Cascaded Density Estimates Using Arithmetic Means   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R    Cascade     Clusters         Objects   -----------------------------------------------   0.2          1            8               0   0.2          2            8               0   0.2          4            7               0  
The following statements produce Output 47.1.3:

  title2 'Normal Mixture Distribution';   data normix;   drop n sigma;   sigma=.125;   do n=1 to 100;   x=rannor(456)*sigma+mod(n,2)/2;   true=exp(-.5*(x/sigma)**2)+exp(-.5*((x-.5)/sigma)**2);   true=.5*true/(sigma*sqrt(2*3.1415926536));   output;   end;   axis1 label=(angle=90 rotate=0) minor=none order=(0 to 3 by 0.5);   axis2 minor=none;   proc modeclus data=normix m=1 k=10 20 40 60 out=out short;   var x;   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2;   plot2 true*x=9;   by _K_;   run;   /*********************************************/   proc modeclus data=normix m=1 r=.05 .10 .20 .30 out=out short;   var x;   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2 ;   plot2 true*x=9;   by _R_;   run;   /*********************************************/   title3 'Cascaded Density Estimates Using Arithmetic Means';   proc modeclus data=normix m=1 r=.05 cascade=1 2 4 am out=out short;   var x;   axis1 label=(angle=90 rotate=0)   minor=none order=(0 to 2 by 0.5);   proc gplot;   plot density*x=cluster /frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2 ;   plot2 true*x=9;   by _R_ _CASCAD_;   run;  
Output 47.1.3: Cluster Analysis of Sample from a Bimodal Mixture of Two Normal Distributions
  Modeclus Example with   Normal Mixture Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   K     Clusters         Objects   -------------------------------------   10            7               0   20            2               0   40            2               0   60            1               0  
  Modeclus Example with   Normal Mixture Distribution   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R     Clusters         Objects   -------------------------------------   0.05            5               0   0.1            2               0   0.2            2               0   0.3            1               0  
  Modeclus Example with   Normal Mixture Distribution   Cascaded Density Estimates Using Arithmetic Means   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R    Cascade     Clusters         Objects   ------------------------------------------------   0.05          1            5               0   0.05          2            4               0   0.05          4            4               0  
Example 47.2. Cluster Analysis of Flying Mileages between Ten American Cities

This example uses distance data and illustrates the use of the TRANSPOSE procedure and the DATA step to fill in the upper triangle of the distance matrix. The results are displayed in Output 47.2.1 through Output 47.2.2.

Output 47.2.1: Clustering with K- Nearest -Neighbor Density Estimates
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   Nearest Neighbor List   CITY               Neighbor               Distance   ---------------------------------------------------   ATLANTA            WASHINGTON D.C.     543.0000000   CHICAGO             587.0000000   ---------------------------------------------------   CHICAGO            ATLANTA             587.0000000   WASHINGTON D.C.     597.0000000   ---------------------------------------------------   DENVER             LOS ANGELES         831.0000000   HOUSTON             879.0000000   ---------------------------------------------------   HOUSTON            ATLANTA             701.0000000   DENVER              879.0000000   ---------------------------------------------------   LOS ANGELES        SAN FRANCISCO       347.0000000   DENVER              831.0000000   ---------------------------------------------------   MIAMI              ATLANTA             604.0000000   WASHINGTON D.C.     923.0000000   ---------------------------------------------------   NEW YORK           WASHINGTON D.C.     205.0000000   CHICAGO             713.0000000   ---------------------------------------------------   SAN FRANCISCO      LOS ANGELES         347.0000000   SEATTLE             678.0000000   ---------------------------------------------------   SEATTLE            SAN FRANCISCO       678.0000000   LOS ANGELES         959.0000000   ---------------------------------------------------   WASHINGTON D.C.    NEW YORK            205.0000000   ATLANTA             543.0000000  
Output 47.2.2: Clustering with Uniform Kernel Density Estimates
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   Nearest Neighbor List   CITY               Neighbor               Distance   ---------------------------------------------------   ATLANTA            WASHINGTON D.C.     543.0000000   CHICAGO             587.0000000   MIAMI               604.0000000   HOUSTON             701.0000000   NEW YORK            748.0000000   ---------------------------------------------------   CHICAGO            ATLANTA             587.0000000   WASHINGTON D.C.     597.0000000   NEW YORK            713.0000000   ---------------------------------------------------   HOUSTON            ATLANTA             701.0000000   ---------------------------------------------------   LOS ANGELES        SAN FRANCISCO       347.0000000   ---------------------------------------------------   MIAMI              ATLANTA             604.0000000   ---------------------------------------------------   NEW YORK           WASHINGTON D.C.     205.0000000   CHICAGO             713.0000000   ATLANTA             748.0000000   ---------------------------------------------------   SAN FRANCISCO      LOS ANGELES         347.0000000   SEATTLE             678.0000000   ---------------------------------------------------   SEATTLE            SAN FRANCISCO       678.0000000   ---------------------------------------------------   WASHINGTON D.C.    NEW YORK            205.0000000   ATLANTA             543.0000000   CHICAGO             597.0000000  
The following statements produce Output 47.2.1:

  title 'Modeclus Analysis of 10 American Cities';   title2 'Based on Flying Mileages';   options ls=90;   data mileages(type=distance);   input (ATLANTA CHICAGO DENVER HOUSTON LOSANGELES   MIAMI NEWYORK SANFRAN SEATTLE WASHDC) (5.)   @53 CITY .;   datalines;   0                                                ATLANTA   587    0                                           CHICAGO   1212  920    0                                      DENVER   701 940 879      0                                 HOUSTON   1936 1745  831 1374    0                            LOS ANGELES   604 1188 1726  968 2339    0                       MIAMI   748  713 1631 1420 2451 1092    0                  NEW YORK   2139 1858 949 1645   347 2594 2571  0               SAN FRANCISCO   2182 1737 1021 1891 959 2734 2408 678   0           SEATTLE   543 597 1494 1220 2300 923 205 2442 2329   0       WASHINGTON D.C.   ;   *-----Fill in Upper Triangle of Distance Matrix---------------;   proc transpose out=tran;   copy CITY;   data mileages(type=distance);   merge mileages tran;   array var ATLANTA--WASHDC;   array col col1-col10;   drop col1-col10 _name_;   do over var;   var=sum(var,col);   end;   *-----Clustering with K-Nearest-Neighbor Density Estimates-----;   proc modeclus data=mileages all m=1 k=3;   id CITY;   run;  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   K=3 METHOD=1   Boundary Objects         -Cluster Proportions   CITY                    Density    Cluster        1        2   DENVER             0.0001706485          2    0.486    0.514   HOUSTON            0.0001706485          1    0.600    0.400   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                   6    0.00027624             1    0.00017065   2                   4    0.00022124             1    0.00017065  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   K     Clusters         Objects   -------------------------------------   3            2               0  

The following statements produce Output 47.2.2:

  *------Clustering with Uniform Kernel Density Estimates--------;   proc modeclus data=mileages all m=1 r=600 800;   id CITY;   run;  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   R=600 METHOD=1   No Boundary Objects   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                   4    0.00033333             0             .   2                   2    0.00016667             0             .   3                   1    0.00008333             0             .   4                   1    0.00008333             0             .   5                   1    0.00008333             0             .   6                   1    0.00008333             0             .   Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   R=800 METHOD=1   No Boundary Objects   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                   6      0.000375             0             .   2                   3     0.0001875             0             .   3                   1     0.0000625             0             .  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of    Unclassified   R     Clusters         Objects   -------------------------------------   600            6               0   800            3               0  

The following statements produce Output 47.2.3:

  *------Uniform Kernel Density Estimates, Clustering   Neighborhoods extended to nearest neighbor--------------;   proc modeclus data=mileages list m=1 ck=2 r=600 800;   id CITY;   run;  
Outptt 47.2.3: Uniform Kernel Density Estimates, Clustering Neighborhoods Extended to Nearest Neighbor
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   CK=2 R=600   METHOD=1   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                   6    0.00033333             0             .   2                   4    0.00016667             0             .  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   CK=2 R=800   METHOD=1   Cluster Statistics   Maximum                   Estimated   Estimated      Boundary        Saddle   Cluster     Frequency       Density     Frequency       Density   ----------------------------------------------------------------   1                   6      0.000375             0             .   2                   4     0.0001875             0             .  
  Modeclus Analysis of 10 American Cities   Based on Flying Mileages   The MODECLUS Procedure   Cluster Summary   Frequency of   Number of     Unclassified   R         CK     Clusters         Objects   ------------------------------------------------   600          2            2               0   800          2            2               0  

Example 47.3. Cluster Analysis with Significance Tests

This example uses artificial data containing two clusters. One cluster is from a circular bivariate normal distribution. The other is a ring-shaped cluster that completely surrounds the first cluster. Without significance tests, the ring is divided into several sample clusters for any degree of smoothing that yields reasonable density estimates. The JOIN= option puts the ring back together. Output 47.3.1 displays a short summary generated from the first PROC MODECLUS statement. Output 47.3.2 contains a series of tables produced from the second PROC MODECLUS statement. The lack of p -value in the JOIN= option makes joining continue until only one cluster remains (see the description of the JOIN= option on page 2866). The cluster memberships are then plotted as displayed in Output 47.3.3.

  title 'Modeclus Analysis with the JOIN= option';   title2 'A Normal Cluster Surrounded by a Ring Cluster';   options ls=120 ps=38;   data circle; keep x y;   c=1;   do n=1 to 30;   x=rannor(5);   y=rannor(5);   output;   end;   c=2;   do n=1 to 300;   x=rannor(5);   y=rannor(5);   z=rannor(5)+8;   l=z/sqrt(x**2+y**2);   x=x*l;   y=y*l;   output;   end;   axis1 label=(angle=90 rotate=0) minor=none   order=(-10 to 10 by 5);   axis2 minor=none order=(-15 to 15 by 5);   proc modeclus data=circle m=1 r=1 to 3.5 by .25 join=20 short;   proc modeclus data=circle m=1 r=2.5 join out=out;   proc gplot data=out;   plot y*x=cluster/frame cframe=ligr   vzero nolegend   vaxis=axis1 haxis=axis2 ;   by _NJOIN_;   run;  
Output 47.3.1: Significance Tests with the JOIN=20 and SHORT Options
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   Cluster Summary   Number of                            Frequency of   Clusters    Maximum    Number of    Unclassified   R       Joined    P-value     Clusters         Objects   -------------------------------------------------------------   1           36     0.9339            1             301   1.25           20     0.7131            1             301   1.5           10     0.3296            1             300   1.75            5     0.1990            2               0   2            5     0.0683            2               0   2.25            3     0.0504            2               0   2.5            4     0.0301            2               0   2.75            3     0.0585            2               0   3            5     0.0003            1               0   3.25            4     0.1923            2               0   3.5            4     0.0000            1               0  
Output 47.3.2: Significance Tests with the JOIN Option
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 103    0.00617328            22    0.00308664        39       19          0       2.495     0.5055   2                  71    0.00571029            20     0.0043213        36       27          9       1.193      0.999   3                  53    0.00509296            18    0.00401263        32       25         10       0.986     0.9999   4                  45    0.00478429            19    0.00354964        30       22         14       1.429     0.9924   5                  30    0.00462996             0             .        29        0          .       3.611     0.0301   6                  28    0.00370397            17    0.00354964        23       22          9       0.000          1  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 103    0.00617328            22    0.00308664        39       19          0       2.495     0.5055   2                  71    0.00571029            20     0.0043213        36       27          9       1.193      0.999   3                  53    0.00509296            18    0.00401263        32       25         10       0.986     0.9999   4                  73    0.00478429            13    0.00293231        30       18          0       1.588     0.9778   5                  30    0.00462996             0             .        29        0          .       3.611     0.0301  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 156    0.00617328            17    0.00246931        39       15          0       3.130     0.1318   2                  71    0.00571029            20     0.0043213        36       27          9       1.193      0.999   3                  73    0.00478429            13    0.00293231        30       18          0       1.588     0.9778   4                  30    0.00462996             0             .        29        0          .       3.611     0.0301  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 156    0.00617328            17    0.00246931        39       15          0       3.130     0.1318   2                 144    0.00571029            14    0.00293231        36       18          0       2.313     0.6447   3                  30    0.00462996             0             .        29        0          .       3.611     0.0301  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 300    0.00617328             0             .        39         0         .       4.246     0.0026   2                  30    0.00462996             0             .        29         0         .       3.611     0.0301  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   R=2.5 METHOD=1   Cluster Statistics   Maximum                   Estimated    -------------Saddle Test: Version 92.7------------   Estimated      Boundary        Saddle      Mode   Saddle    Overlap                 Approx   Cluster     Frequency       Density     Frequency       Density     Count    Count      Count           Z    P-value   ---------------------------------------------------------------------------------------------------------------------   1                 300    0.00617328             0             .        39        0          .       4.246     0.0026  
  Modeclus Analysis with the JOIN= option   A Normal Cluster Surrounded by a Ring Cluster   The MODECLUS Procedure   Cluster Summary   Number of                            Frequency of   Clusters    Maximum    Number of    Unclassified   R       Joined    P-value     Clusters         Objects   -------------------------------------------------------------   2.5            0     1.0000            6               0   2.5            1     0.9999            5               0   2.5            2     0.9990            4               0   2.5            3     0.6447            3               0   2.5            4     0.0301            2               0   2.5            5     0.0026            1              30  
Output 47.3.3: Scatter Plots of Cluster Memberships by _NJOIN_
  click to expand  
Example 47.4. Cluster Analysis: Hertzsprung-Russell Plot

This example uses computer-generated data to mimic a Hertzsprung-Russell plot (Struve and Zebergs 1962, p. 259) of the temperature and luminosity of stars. The data are plotted and displayed in Output 47.4.1; see Example 4 from Proc Modeclus in the SAS/STAT Sample Program Library for the complete data set. It appears that there are two main groups of stars and a collection of isolated stars. The long straggling group of points appearing diagonally across the figure represents the main group of stars; the more compact group in the top right-hand corner contains giant stars. The JOIN= option is specified at a 0.05 significance level with various smoothing parameters. The CK=5 option is specified in order to prevent the numerous outliers from forming separate clusters. The results from PROC MODECLUS is displayed in Output 47.4.2. The cluster memberships are then plotted by PROC GPLOT, as displayed in Output 47.4.3.

Output 47.4.1: Scatter Plot of Data
  click to expand  
Output 47.4.2: Results from PROC MODECLUS
  Hertzsprung-Russell Plot of Visible Stars   Computer-Generated Fake Data   The MODECLUS Procedure   Cluster Summary   Number of                            Frequency of   Clusters    Maximum    Number of    Unclassified   R         CK       Joined    P-value     Clusters         Objects   ------------------------------------------------------------------------   1          5           14     0.0001            2               0   1.5          5            6     0.0000            3               0   2          5            4     0.0000            2               0   2.5          5            2     0.0000            1               0  
Output 47.4.3: Scatter Plots of Cluster Memberships by _R_
  click to expand  
Notice in Output 47.4.3 that the graphic output from PROC GPLOT when _R_ = 2.5 is not available because only one cluster remains after joining at a 5% significance level, and the results are not written to the OUT= data set. See the description of the JOIN= option on page 2866 for more information.

  title 'Hertzsprung-Russell Plot of Visible Stars';   title2 'Computer-Generated Fake Data';   data hr;   input x y @@;   label x='-Temperature'   y='-Luminosity';   datalines;   1.0 12.8   0.9   13.7   0.9  12.9   1.0 12.3   1.0   12.2   2.6 10.9   2.4   10.9   2.5  11.2   2.3 11.5   2.6   12.0   2.4 12.1   2.3   10.9   2.6  11.5   2.5 11.9   2.4   11.0   3.4 11.1   3.3   11.2   3.4  11.1   3.4   9.9   3.2  10.4   ... 150 lines omitted ...   18.5 12.6 14.2 16.1 23.2   6.6 11.4 12.4 20.4         11.7   20.9   8.1 18.9 13.7 16.9   9.7 15.5   9.9 18.3       14.2   19.3 13.7 17.0 12.9 10.1 11.6 17.9 13.5 14.3           1.4   13.1 -0.8   8.1 -0.9 20.0   7.0 21.0   8.5 15.6       13.2   ;   symbol1 value=circle c=white;   symbol2 value=plus c=yellow;   symbol3 value=triangle c=cyan;   legend1 frame cframe=ligr cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   proc gplot;   plot y*x/legend=legend1 frame cframe=ligr vzero   vaxis=axis1 haxis=axis2 ;   proc modeclus data=hr m=1 r=1 1.5 2 2.5 ck=5   join=.05 short out=out;   run;   title2 'MODECLUS Analysis';   proc gplot;   plot y*x=cluster/frame cframe=ligr   vzero legend=legend1   vaxis=axis1 haxis=axis2;   by _R_;   run;  

Example 47.5. Using the TRACE Option when METHOD=6

To illustrate how the TRACE option can help you to understand the clustering process when METHOD=6 is specified, the following data set is created with 12 observations.

  data test;   input x@@;   datalines;   1 2 3 4 5 7.5 9 11.5 13 14.5 15 16   ;  

The first five observations seem to be close to each other, and the last five observations seem to be close to each other. Observation 6 is separated from the first five observations with a (Euclidean) distance of 2.5, and the same distance separates observation 7 from the last five observations. Observations 6 and 7 differ by 1.5.

Suppose METHOD=6 with a radius=2.5 is chosen for the cluster analysis. You can specify the TRACE option to understand how each observation is assigned.

The following statements produce Output 47.5.1 and Output 47.5.2:

  /*-- METHOD=6 with TRACE and THRESHOLD=0.5 (default) --*/   proc modeclus method=6 r=2.5 trace short out=out;   var x;   run;   data markobs;   drop _r_ _method_ _obs_ density cluster;   length function style  text $ 2;   retain xsys '2' ysys '2' hsys '1' when 'a';   set out;   /* create the text for obs */   function='label'; size=4;   style='swiss';   text=left(put(_obs_,2.));   position='3';   x=x; y=density;   output;   run;   legend1 frame cframe=ligr cborder=black   position=center value=(justify=center);   axis1 label=(angle=90 rotate=0) minor=none;   axis2 minor=none;   title 'Plot of DENSITY*X=CLUSTER';   proc gplot data=out;   plot density*x=cluster/ annotate=markobs   frame cframe=ligr   legend=legend1   vaxis=axis1 haxis=axis2;   run;  
Output 47.5.1: Partial Output of METHOD=6 with TRACE and Default THRESHOLD=
  The MODECLUS Procedure   R=2.5 METHOD=6   Trace of Clustering Algorithm   Cluster   Obs           Density     Old     New     Ratio   ------------------------------------------------   3       0.0833333      -1       1      M   2       0.0666667       0       1      N   4       0.0666667       0       1      N   5       0.0666667       0       1      N   1       0.0500000       0       1      N   6       0.0500000       0       1     0.571   7       0.0500000      -1       1     0.500   9       0.0666667      -1       2      M   8       0.0500000       0       2      N   10       0.0666667      -1       2      S   12       0.0500000       0       2      N   11       0.0666667      -1       2      S  
Output 47.5.2: Density Plot
click to expand
Notice that in Output 47.5.1, observation 7 is originally a seed (indicated by a value of -1 in the Old column) and then assigned to cluster 1. This is because the ratio of observation 7 to cluster 1 is 0.5 and is not less than the default value of THRESHOLD= (0.5).

If the value of the THRESHOLD= option is increased to 0.55, observation 7 should be excluded from cluster 1 and the cluster membership of observation 7 is changed.

The following statements produce Output 47.5.3 and Output 47.5.4:

  /*-- METHOD=6 with TRACE and THRESHOLD=0.55 --*/   proc modeclus method=6 r=2.5 trace threshold=0.55 short   out=out;   var x;   run;   . . .   (the Data Step and the PROC GPLOT statement   are omitted because they are the same as the   previous job)  
Output 47.5.3: Partial Output of METHOD=6 with TRACE and THRESHOLD=.55
  The MODECLUS Procedure   R=2.5 METHOD=6   Trace of Clustering Algorithm   Cluster   Obs           Density     Old     New     Ratio   ------------------------------------------------   3       0.0833333   1       1      M   2       0.0666667       0       1      N   4       0.0666667       0       1      N   5       0.0666667       0       1      N   1       0.0500000       0       1      N   6       0.0500000       0       1      0.571   9       0.0666667   1       2      M   8       0.0500000       0       2      N   10       0.0666667   1       2      S   12       0.0500000       0       2      N   11       0.0666667   1       2      S   7       0.0500000   1       2      S  
Output 47.5.4: Density Plot
click to expand
In Output 47.5.3, observation 7 is a seed that is excluded by cluster 1 because its ratio to cluster 1 is less than 0.55. Being a neighbor of a member (observation 8) of cluster 2, observation 7 eventually joins cluster 2 even though it remains a SEED. (See Step 2.2 in the section METHOD=6 on page 2875.)

