Examples


Example 58.1. Temperatures

This example analyzes mean daily temperatures in selected cities in January and July. Both the raw data and the principal components are plotted to illustrate how principal components are orthogonal rotations of the original variables .

The following statements create the Temperature data set:

  data Temperature;   title 'Mean Temperature in January and July for Selected Cities ';   input City -15 January July;   cards;   Mobile          51.2 81.6   Phoenix         51.2 91.2   Little Rock     39.5 81.4   Sacramento      45.1 75.2   Denver          29.9 73.0   Hartford        24.8 72.7   Wilmington      32.0 75.8   Washington DC   35.6 78.7   Jacksonville    54.6 81.0   Miami           67.2 82.3   Atlanta         42.4 78.0   Boise           29.0 74.5   Chicago         22.9 71.9   Peoria          23.8 75.1   Indianapolis    27.9 75.0   Des Moines      19.4 75.1   Wichita         31.3 80.7   Louisville      33.3 76.9   New Orleans     52.9 81.9   Portland, ME    21.5 68.0   Baltimore       33.4 76.6   Boston          29.2 73.3   Detroit         25.5 73.3   Sault Ste Marie 14.2 63.8   Duluth           8.5 65.6   Minneapolis     12.2 71.9   Jackson         47.1 81.7   Kansas City     27.8 78.8   St Louis        31.3 78.6   Great Falls     20.5 69.3   Omaha           22.6 77.2   Reno            31.9 69.3   Concord         20.6 69.7   Atlantic City   32.7 75.1   Albuquerque     35.2 78.7   Albany          21.5 72.0   Buffalo         23.7 70.1   New York        32.2 76.6   Charlotte       42.1 78.5   Raleigh         40.5 77.5   Bismarck         8.2 70.8   Cincinnati      31.1 75.6   Cleveland       26.9 71.4   Columbus        28.4 73.6   Oklahoma City   36.8 81.5   Portland, OR    38.1 67.1   Philadelphia    32.3 76.8   Pittsburgh      28.1 71.9   Providence      28.4 72.1   Columbia        45.4 81.2   Sioux Falls     14.2 73.3   Memphis         40.5 79.6   Nashville       38.3 79.6   Dallas          44.8 84.8   El Paso         43.6 82.3   Houston         52.1 83.3   Salt Lake City  28.0 76.7   Burlington      16.8 69.8   Norfolk         40.5 78.3   Richmond        37.5 77.9   Spokane         25.4 69.7   Charleston, WV  34.5 75.0   Milwaukee       19.4 69.9   Cheyenne        26.6 69.1   ;  

The following statements plot the temperature data set. For information on the %PLOTIT macro, see Appendix B, Using the % PLOTIT Macro.

  title2 'Plot of Raw Data';   %plotit(data=Temperature,labelvar=City,   plotvars=July January, color=black, colors=black);   run;  

The results are displayed in Output 58.1.1, which shows a scatter diagram of the 64 pairs of data points with July temperatures plotted against January temperatures.

Output 58.1.1: Plot of Raw Data
start example
click to expand
end example
 

The following statement requests a principal component analysis on the Temperature data set and outputs the scores to the Prin data set (OUT= Prin ):

  proc princomp data=Temperature cov out=Prin;   var July January;   run;  

Output 58.1.2 displays the PROC PRINCOMP output. The standard deviation of January (11.712) is higher than the standard deviation of July (5.128). The COV option in the PROC PRINCOMP statement requests the principal components to be computed from the covariance matrix. The total variance is 163.474. The first principal component explains about 94% of the total variance, and the second principal component explains only about 6%. The eigenvalues sum to the total variance.

Output 58.1.2: Results of Principal Component Analysis
start example
  Mean Temperature in January and July for Selected Cities   The PRINCOMP Procedure   Observations          64   Variables              2   Simple Statistics   July          January   Mean       75.60781250      32.09531250   StD         5.12761910      11.71243309   Covariance Matrix   July         January   July           26.2924777      46.8282912   January        46.8282912     137.1810888   Total Variance    163.47356647   Eigenvalues of the Covariance Matrix   Eigenvalue    Difference    Proportion  Cumulative   1    154.310607    145.147647        0.9439      0.9439   2      9.162960                      0.0561      1.0000   Eigenvectors   Prin1         Prin2   July         0.343532      0.939141   January      0.939141   .343532  
end example
 

Note that January receives a higher loading on Prin1 because it has a higher standard deviation than July , and the PRINCOMP procedure calculates the scores using the centered variables rather than the standardized variables.

The following statement plots the Prin data set created from the previous PROC PRINCOMP statement:

  title2 'Plot of Principal Components';   %plotit(data=Prin,labelvar=City,   plotvars=Prin2 Prin1, color=black, colors=black);   run;  

Output 58.1.3 displays a plot of the second principal component Prin2 against the first principal component Prin1 . It is clear from this plot that the principal components are orthogonal rotations of the original variables and that the first principal component has a larger variance than the second principal component. In fact, Prin1 has a larger variance than either of the original variables July and January .

Output 58.1.3: Plot of Principal Components
start example
click to expand
end example
 

Example 58.2. Crime Rates

The following data provide crime rates per 100,000 people in seven categories for each of the fifty states in 1977. Since there are seven numeric variables, it is impossible to plot all the variables simultaneously . Principal components can be used to summarize the data in two or three dimensions, and they help to visualize the data. The following statements produce Output 58.2.1:

  data Crime;   title 'Crime Rates per 100,000 Population by State';   input State -15 Murder Rape Robbery Assault   Burglary Larceny Auto_Theft;   cards;   Alabama        14.2 25.2  96.8 278.3 1135.5 1881.9 280.7   Alaska         10.8 51.6  96.8 284.0 1331.7 3369.8 753.3   Arizona         9.5 34.2 138.2 312.3 2346.1 4467.4 439.5   Arkansas        8.8 27.6  83.2 203.4  972.6 1862.1 183.4   California     11.5 49.4 287.0 358.0 2139.4 3499.8 663.5   Colorado        6.3 42.0 170.7 292.9 1935.2 3903.2 477.1   Connecticut     4.2 16.8 129.5 131.8 1346.0 2620.7 593.2   Delaware        6.0 24.9 157.0 194.2 1682.6 3678.4 467.0   Florida        10.2 39.6 187.9 449.1 1859.9 3840.5 351.4   Georgia        11.7 31.1 140.5 256.5 1351.1 2170.2 297.9   Hawaii          7.2 25.5 128.0  64.1 1911.5 3920.4 489.4   Idaho           5.5 19.4  39.6 172.5 1050.8 2599.6 237.6   Illinois        9.9 21.8 211.3 209.0 1085.0 2828.5 528.6   Indiana         7.4 26.5 123.2 153.5 1086.2 2498.7 377.4   Iowa            2.3 10.6  41.2  89.8  812.5 2685.1 219.9   Kansas          6.6 22.0 100.7 180.5 1270.4 2739.3 244.3   Kentucky       10.1 19.1  81.1 123.3  872.2 1662.1 245.4   Louisiana      15.5 30.9 142.9 335.5 1165.5 2469.9 337.7   Maine           2.4 13.5  38.7 170.0 1253.1 2350.7 246.9   Maryland        8.0 34.8 292.1 358.9 1400.0 3177.7 428.5   Massachusetts   3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1   Michigan        9.3 38.9 261.9 274.6 1522.7 3159.0 545.5   Minnesota       2.7 19.5  85.9  85.8 1134.7 2559.3 343.1   Mississippi    14.3 19.6  65.7 189.1  915.6 1239.9 144.4   Missouri        9.6 28.3 189.0 233.5 1318.3 2424.2 378.4   Montana         5.4 16.7  39.2 156.8  804.9 2773.2 309.2   Nebraska        3.9 18.1  64.7 112.7  760.0 2316.1 249.1   Nevada         15.8 49.1 323.1 355.0 2453.1 4212.6 559.2   New Hampshire   3.2 10.7  23.2  76.0 1041.7 2343.9 293.4   New Jersey      5.6 21.0 180.4 185.1 1435.8 2774.5 511.5   New Mexico      8.8 39.1 109.6 343.4 1418.7 3008.6 259.5   New York       10.7 29.4 472.6 319.1 1728.0 2782.0 745.8   North Carolina 10.6 17.0  61.3 318.3 1154.1 2037.8 192.1   North Dakota    0.9  9.0  13.3  43.8  446.1 1843.0 144.7   Ohio            7.8 27.3 190.5 181.1 1216.0 2696.8 400.4   Oklahoma        8.6 29.2  73.8 205.0 1288.2 2228.1 326.8   Oregon          4.9 39.9 124.1 286.9 1636.4 3506.1 388.9   Pennsylvania    5.6 19.0 130.3 128.0  877.5 1624.1 333.2   Rhode Island    3.6 10.5  86.5 201.0 1489.5 2844.1 791.4   South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1   South Dakota    2.0 13.5  17.9 155.7  570.5 1704.4 147.5   Tennessee      10.1 29.7 145.8 203.9 1259.7 1776.5 314.0   Texas          13.3 33.8 152.4 208.2 1603.1 2988.7 397.6   Utah            3.5 20.3  68.8 147.3 1171.6 3004.6 334.5   Vermont         1.4 15.9  30.8 101.2 1348.2 2201.0 265.2   Virginia        9.0 23.3  92.1 165.7  986.2 2521.2 226.7   Washington      4.3 39.6 106.2 224.8 1605.6 3386.9 360.3   West Virginia   6.0 13.2  42.2  90.9  597.4 1341.7 163.3   Wisconsin       2.8 12.9  52.2  63.7  846.9 2614.2 220.7   Wyoming         5.4 21.9  39.7 173.9  811.6 2772.2 282.0   ;   proc princomp out=Crime_Components;   run;  
Output 58.2.1: Results of Principal Component Analysis ”PROC PRINCOMP
start example
  Crime Rates per 100,000 Population by State   The PRINCOMP Procedure   Observations          50   Variables              7   Simple Statistics   Murder              Rape         Robbery         Assault   Mean      7.444000000       25.73400000     124.0920000     211.3000000   StD       3.866768941       10.75962995      88.3485672     100.2530492   Simple Statistics   Burglary           Larceny      Auto_Theft   Mean       1291.904000       2671.288000     377.5260000   StD         432.455711        725.908707     193.3944175   Correlation Matrix   Auto_   Murder     Rape   Robbery   Assault Burglary   Larceny  Theft   Murder     1.0000   0.6012    0.4837    0.6486   0.3858    0.1019 0.0688   Rape       0.6012   1.0000    0.5919    0.7403   0.7121    0.6140 0.3489   Robbery    0.4837   0.5919    1.0000    0.5571   0.6372    0.4467 0.5907   Assault    0.6486   0.7403    0.5571    1.0000   0.6229    0.4044 0.2758   Burglary   0.3858   0.7121    0.6372    0.6229   1.0000    0.7921 0.5580   Larceny    0.1019   0.6140    0.4467    0.4044   0.7921    1.0000 0.4442   Auto_Theft 0.0688   0.3489    0.5907    0.2758   0.5580    0.4442 1.0000   Eigenvalues of the Correlation Matrix   Eigenvalue    Difference    Proportion  Cumulative   1    4.11495951    2.87623768        0.5879      0.5879   2    1.23872183    0.51290521        0.1770      0.7648   3    0.72581663    0.40938458        0.1037      0.8685   4    0.31643205    0.05845759        0.0452      0.9137   5    0.25797446    0.03593499        0.0369      0.9506   6    0.22203947    0.09798342        0.0317      0.9823   7    0.12405606                      0.0177      1.0000   Eigenvectors   Prin1     Prin2     Prin3     Prin4     Prin5    Prin6    Prin7   Murder     0.300279   .629174  0.178245   .232114  0.538123 0.259117 0.267593   Rape       0.431759   .169435   .244198  0.062216  0.188471   .773271   .296485   Robbery    0.396875  0.042247  0.495861   .557989   .519977   .114385   .003903   Assault    0.396652   .343528   .069510  0.629804   .506651 0.172363 0.191745   Burglary   0.440157  0.203341   .209895   .057555  0.101033 0.535987   .648117   Larceny    0.357360  0.402319   .539231   .234890  0.030099 0.039406 0.601690   Auto_Theft 0.295177  0.502421  0.568384  0.419238  0.369753   .057298 0.147046  
end example
 

The eigenvalues indicate that two or three components provide a good summary of the data, two components accounting for 76% of the total variance and three components explaining 87%. Subsequent components contribute less than 5% each.

The first component is a measure of overall crime rate since the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on variables Auto_Theft and Larceny and high negative loadings on variables Murder and Assault . There is also a small positive loading on Burglary and a small negative loading on Rape . This component seems to measure the preponderance of property crime over violent crime. The interpretation of the third component is not obvious.

A simple way to examine the principal components in more detail is to display the output data set sorted by each of the large components. The following statements produce Output 58.2.2 through Output 58.2.3:

  proc sort data=Crime_Components;   by Prin1;   run;   proc print;   id State;   var Prin1 Prin2 Murder Rape Robbery   Assault Burglary Larceny Auto_Theft;   title2 'States Listed in Order of Overall Crime Rate';   title3 'As Determined by the First Principal Component';   run;   proc sort data=Crime_Components;   by Prin2;   run;   proc print;   id State;   var Prin1 Prin2 Murder Rape Robbery   Assault Burglary Larceny Auto_Theft;   title2 'States Listed in Order of Property Vs. Violent Crime';   title3 'As Determined by the Second Principal Component';   run;  
Output 58.2.2: OUT= Data Set Sorted by First Principal Component
start example
  Crime Rates per 100,000 Population by State   States Listed in Order of Overall Crime Rate   As Determined by the First Principal Component   A   u   B             t   R     A      u      L      o   M        o     s      r      a      _   S                  P        P       u        b     s      g      r      T   t                  r        r       r   R    b     a      l      c      h   a                  i        i       d   a    e     u      a      e      e   t                  n        n       e   p    r     l      r      n      f   e                  1        2       r   e    y     t      y      y      t   North Dakota   3.96408  0.38767  0.9  9.0  13.3  43.8  446.1 1843.0  144.7   South Dakota   3.17203   0.25446  2.0 13.5  17.9 155.7  570.5 1704.4  147.5   West Virginia   3.14772   0.81425  6.0 13.2  42.2  90.9  597.4 1341.7  163.3   Iowa   2.58156  0.82475  2.3 10.6  41.2  89.8  812.5 2685.1  219.9   Wisconsin   2.50296  0.78083  2.8 12.9  52.2  63.7  846.9 2614.2  220.7   New Hampshire   2.46562  0.82503  3.2 10.7  23.2  76.0 1041.7 2343.9  293.4   Nebraska   2.15071  0.22574  3.9 18.1  64.7 112.7  760.0 2316.1  249.1   Vermont   2.06433  0.94497  1.4 15.9  30.8 101.2 1348.2 2201.0  265.2   Maine   1.82631  0.57878  2.4 13.5  38.7 170.0 1253.1 2350.7  246.9   Kentucky   1.72691   1.14663 10.1 19.1  81.1 123.3  872.2 1662.1  245.4   Pennsylvania   1.72007   0.19590  5.6 19.0 130.3 128.0  877.5 1624.1  333.2   Montana   1.66801  0.27099  5.4 16.7  39.2 156.8  804.9 2773.2  309.2   Minnesota   1.55434  1.05644  2.7 19.5  85.9  85.8 1134.7 2559.3  343.1   Mississippi   1.50736   2.54671 14.3 19.6  65.7 189.1  915.6 1239.9  144.4   Idaho   1.43245   0.00801  5.5 19.4  39.6 172.5 1050.8 2599.6  237.6   Wyoming   1.42463  0.06268  5.4 21.9  39.7 173.9  811.6 2772.2  282.0   Arkansas   1.05441   1.34544  8.8 27.6  83.2 203.4  972.6 1862.1  183.4   Utah   1.04996  0.93656  3.5 20.3  68.8 147.3 1171.6 3004.6  334.5   Virginia   0.91621   0.69265  9.0 23.3  92.1 165.7  986.2 2521.2  226.7   North Carolina   0.69925   1.67027 10.6 17.0  61.3 318.3 1154.1 2037.8  192.1   Kansas   0.63407   0.02804  6.6 22.0 100.7 180.5 1270.4 2739.3  244.3   Connecticut   0.54133  1.50123  4.2 16.8 129.5 131.8 1346.0 2620.7  593.2   Indiana   0.49990  0.00003  7.4 26.5 123.2 153.5 1086.2 2498.7  377.4   Oklahoma   0.32136   0.62429  8.6 29.2  73.8 205.0 1288.2 2228.1  326.8   Rhode Island   0.20156  2.14658  3.6 10.5  86.5 201.0 1489.5 2844.1  791.4   Tennessee   0.13660   1.13498 10.1 29.7 145.8 203.9 1259.7 1776.5  314.0   Alabama   0.04988   2.09610 14.2 25.2  96.8 278.3 1135.5 1881.9  280.7   New Jersey      0.21787  0.96421  5.6 21.0 180.4 185.1 1435.8 2774.5  511.5   Ohio            0.23953  0.09053  7.8 27.3 190.5 181.1 1216.0 2696.8  400.4   Georgia         0.49041   1.38079 11.7 31.1 140.5 256.5 1351.1 2170.2  297.9   Illinois        0.51290  0.09423  9.9 21.8 211.3 209.0 1085.0 2828.5  528.6   Missouri        0.55637   0.55851  9.6 28.3 189.0 233.5 1318.3 2424.2  378.4   Hawaii          0.82313  1.82392  7.2 25.5 128.0  64.1 1911.5 3920.4  489.4   Washington      0.93058  0.73776  4.3 39.6 106.2 224.8 1605.6 3386.9  360.3   Delaware        0.96458  1.29674  6.0 24.9 157.0 194.2 1682.6 3678.4  467.0   Massachusetts   0.97844  2.63105  3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1   Louisiana       1.12020   2.08327 15.5 30.9 142.9 335.5 1165.5 2469.9  337.7   New Mexico      1.21417   0.95076  8.8 39.1 109.6 343.4 1418.7 3008.6  259.5   Texas           1.39696   0.68131 13.3 33.8 152.4 208.2 1603.1 2988.7  397.6   Oregon          1.44900  0.58603  4.9 39.9 124.1 286.9 1636.4 3506.1  388.9   South Carolina  1.60336   2.16211 11.9 33.0 105.9 485.3 1613.6 2342.4  245.1   Maryland        2.18280   0.19474  8.0 34.8 292.1 358.9 1400.0 3177.7  428.5   Michigan        2.27333  0.15487  9.3 38.9 261.9 274.6 1522.7 3159.0  545.5   Alaska          2.42151  0.16652 10.8 51.6  96.8 284.0 1331.7 3369.8  753.3   Colorado        2.50929  0.91660  6.3 42.0 170.7 292.9 1935.2 3903.2  477.1   Arizona         3.01414  0.84495  9.5 34.2 138.2 312.3 2346.1 4467.4  439.5   Florida         3.11175   0.60392 10.2 39.6 187.9 449.1 1859.9 3840.5  351.4   New York        3.45248  0.43289 10.7 29.4 472.6 319.1 1728.0 2782.0  745.8   California      4.28380  0.14319 11.5 49.4 287.0 358.0 2139.4 3499.8  663.5   Nevada          5.26699   0.25262 15.8 49.1 323.1 355.0 2453.1 4212.6  559.2  
end example
 
Output 58.2.3: OUT= Data Set Sorted by Second Principal Component
start example
  Crime Rates per 100,000 Population by State   States Listed in Order of Property Vs. Violent Crime   As Determined by the Second Principal Component   A   u   B             t   R     A      u      L      o   M        o     s      r      a      _   S                  P        P       u        b     s      g      r      T   t                  r        r       r    R   b     a      l      c      h   a                  i        i       d    a   e     u      a      e      e   t                  n        n       e    p   r     l      r      n      f   e                  1        2       r    e   y     t      y      y      t   Mississippi   1.50736   2.54671 14.3 19.6  65.7 189.1  915.6 1239.9  144.4   South Carolina  1.60336   2.16211 11.9 33.0 105.9 485.3 1613.6 2342.4  245.1   Alabama   0.04988   2.09610 14.2 25.2  96.8 278.3 1135.5 1881.9  280.7   Louisiana       1.12020   2.08327 15.5 30.9 142.9 335.5 1165.5 2469.9  337.7   North Carolina   0.69925   1.67027 10.6 17.0  61.3 318.3 1154.1 2037.8  192.1   Georgia         0.49041   1.38079 11.7 31.1 140.5 256.5 1351.1 2170.2  297.9   Arkansas   1.05441   1.34544  8.8 27.6  83.2 203.4  972.6 1862.1  183.4   Kentucky   1.72691   1.14663 10.1 19.1  81.1 123.3  872.2 1662.1  245.4   Tennessee   0.13660   1.13498 10.1 29.7 145.8 203.9 1259.7 1776.5  314.0   New Mexico      1.21417   0.95076  8.8 39.1 109.6 343.4 1418.7 3008.6  259.5   West Virginia   3.14772   0.81425  6.0 13.2  42.2  90.9  597.4 1341.7  163.3   Virginia   0.91621   0.69265  9.0 23.3  92.1 165.7  986.2 2521.2  226.7   Texas           1.39696   0.68131 13.3 33.8 152.4 208.2 1603.1 2988.7  397.6   Oklahoma   0.32136   0.62429  8.6 29.2  73.8 205.0 1288.2 2228.1  326.8   Florida         3.11175   0.60392 10.2 39.6 187.9 449.1 1859.9 3840.5  351.4   Missouri        0.55637   0.55851  9.6 28.3 189.0 233.5 1318.3 2424.2  378.4   South Dakota   3.17203   0.25446  2.0 13.5  17.9 155.7  570.5 1704.4  147.5   Nevada          5.26699   0.25262 15.8 49.1 323.1 355.0 2453.1 4212.6  559.2   Pennsylvania   1.72007   0.19590  5.6 19.0 130.3 128.0  877.5 1624.1  333.2   Maryland        2.18280   0.19474  8.0 34.8 292.1 358.9 1400.0 3177.7  428.5   Kansas   0.63407   0.02804  6.6 22.0 100.7 180.5 1270.4 2739.3  244.3   Idaho   1.43245   0.00801  5.5 19.4  39.6 172.5 1050.8 2599.6  237.6   Indiana   0.49990  0.00003  7.4 26.5 123.2 153.5 1086.2 2498.7  377.4   Wyoming   1.42463  0.06268  5.4 21.9  39.7 173.9  811.6 2772.2  282.0   Ohio            0.23953  0.09053  7.8 27.3 190.5 181.1 1216.0 2696.8  400.4   Illinois        0.51290  0.09423  9.9 21.8 211.3 209.0 1085.0 2828.5  528.6   California      4.28380  0.14319 11.5 49.4 287.0 358.0 2139.4 3499.8  663.5   Michigan        2.27333  0.15487  9.3 38.9 261.9 274.6 1522.7 3159.0  545.5   Alaska          2.42151  0.16652 10.8 51.6  96.8 284.0 1331.7 3369.8  753.3   Nebraska   2.15071  0.22574  3.9 18.1  64.7 112.7  760.0 2316.1  249.1   Montana   1.66801  0.27099  5.4 16.7  39.2 156.8  804.9 2773.2  309.2   North Dakota   3.96408  0.38767  0.9  9.0  13.3  43.8  446.1 1843.0  144.7   New York        3.45248  0.43289 10.7 29.4 472.6 319.1 1728.0 2782.0  745.8   Maine   1.82631  0.57878  2.4 13.5  38.7 170.0 1253.1 2350.7  246.9   Oregon          1.44900  0.58603  4.9 39.9 124.1 286.9 1636.4 3506.1  388.9   Washington      0.93058  0.73776  4.3 39.6 106.2 224.8 1605.6 3386.9  360.3   Wisconsin   2.50296  0.78083  2.8 12.9  52.2  63.7  846.9 2614.2  220.7   Iowa   2.58156  0.82475  2.3 10.6  41.2  89.8  812.5 2685.1  219.9   New Hampshire   2.46562  0.82503  3.2 10.7  23.2  76.0 1041.7 2343.9  293.4   Arizona         3.01414  0.84495  9.5 34.2 138.2 312.3 2346.1 4467.4  439.5   Colorado        2.50929  0.91660  6.3 42.0 170.7 292.9 1935.2 3903.2  477.1   Utah   1.04996  0.93656  3.5 20.3  68.8 147.3 1171.6 3004.6  334.5   Vermont   2.06433  0.94497  1.4 15.9  30.8 101.2 1348.2 2201.0  265.2   New Jersey      0.21787  0.96421  5.6 21.0 180.4 185.1 1435.8 2774.5  511.5   Minnesota   1.55434  1.05644  2.7 19.5  85.9  85.8 1134.7 2559.3  343.1   Delaware        0.96458  1.29674  6.0 24.9 157.0 194.2 1682.6 3678.4  467.0   Connecticut   0.54133  1.50123  4.2 16.8 129.5 131.8 1346.0 2620.7  593.2   Hawaii          0.82313  1.82392  7.2 25.5 128.0  64.1 1911.5 3920.4  489.4   Rhode Island   0.20156  2.14658  3.6 10.5  86.5 201.0 1489.5 2844.1  791.4   Massachusetts   0.97844  2.63105  3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1  
end example
 

Another recommended procedure is to make scatter plots of the first few components. The sorted listings help to identify observations on the plots. The following statements produce Output 58.2.4 through Output 58.2.5:

  title2 'Plot of the First Two Principal Components';   %plotit(data=Crime_Components,labelvar=State,   plotvars=Prin2 Prin1, color=black, colors=black);   run;   title2 'Plot of the First and Third Principal Components';   %plotit(data=Crime_Components,labelvar=State,   plotvars=Prin3 Prin1, color=black, colors=black);   run;  
Output 58.2.4: Plot of the First Two Principal Components
start example
click to expand
end example
 
Output 58.2.5: Plot of the First and Third Principal Components
start example
click to expand
end example
 

It is possible to identify regional trends on the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North and South Dakota are on the extreme left with low overall crime rates. Southeastern states tend to be in the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a greater-than-average ratio of property crime to violent crime.

The most striking feature of the plot of the first and third principal components is that Massachusetts and New York are outliers on the third component.

Example 58.3. Basketball Data

The data in this example are rankings of 35 college basketball teams . The rankings were made before the start of the 1985_86 season by 10 news services.

The purpose of the principal component analysis is to compute a single variable that best summarizes all 10 of the preseason rankings.

Note that the various news services rank different numbers of teams, varying from 20 through 30 (there is a missing rank in one of the variables, WashPost ). And, of course, each service does not rank the same teams, so there are missing values in these data. Each of the 35 teams is ranked by at least one news service.

The PRINCOMP procedure omits observations with missing values. To obtain principal component scores for all of the teams, it is necessary to replace the missing values. Since it is the best teams that are ranked, it is not appropriate to replace missing values with the mean of the nonmissing values. Instead, an ad hoc method is used that replaces missing values by the mean of the unassigned ranks. For example, if 20 teams are ranked by a news service, then ranks 21 through 35 are unassigned . The mean of ranks 21 through 35 is 28, so missing values for that variable are replaced by the value 28. To prevent the method of missing-value replacement from having an undue effect on the analysis, each observation is weighted according to the number of nonmissing values it has. See Example 59.3 in Chapter 59, The PRINQUAL Procedure, for an alternative analysis of these data.

Since the first principal component accounts for 78% of the variance, there is substantial agreement among the rankings. The eigenvector shows that all the news services are about equally weighted, so a simple average would work almost as well as the first principal component. The following statements produce Output 58.3.1 through Output 58.3.3:

  /*-----------------------------------------------------------*/   /*                                                           */   /* Preseason 1985 College Basketball Rankings                */   /* (rankings of 35 teams by 10 news services)                */   /*                                                           */   /* Note: (a) news services rank varying numbers of teams;    */   /*       (b) not all teams are ranked by all news services;  */   /*       (c) each team is ranked by at least one service;    */   /*       (d) rank 20 is missing for UPI.                     */   /*                                                           */   /*-----------------------------------------------------------*/   data HoopsRanks;   input School . CSN DurSun DurHer WashPost USAToday   Sport InSports UPI AP SI;   label CSN      = 'Community Sports News (Chapel Hill, NC)'   DurSun   = 'Durham Sun'   DurHer   = 'Durham Morning Herald'   WashPost = 'Washington Post'   USAToday = 'USA Today'   Sport    = 'Sport Magazine'   InSports = 'Inside Sports'   UPI      = 'United Press International'   AP       = 'Associated Press'   SI       = 'Sports Illustrated'   ;   format CSN--SI 5.1;   cards;   Louisville     1  8  1  9  8   9 6 10  9  9   Georgia Tech   2  2  4  3  1  1  1  2  1  1   Kansas         3  4  5  1  5  11 8  4  5  7   Michigan       4  5  9  4  2   5 3  1  3  2   Duke           5  6  7  5  4  10 4  5  6  5   UNC            6  1  2  2  3   4 2  3  2  3   Syracuse       7 10  6 11  6  6  5  6  4 10   Notre Dame     8 14 15 13 11 20 18 13 12  .   Kentucky       9 15 16 14 14 19 11 12 11 13   LSU           10  9 13  . 13 15 16  9 14  8   DePaul        11  . 21 15 20  . 19  .  . 19   Georgetown    12  7  8  6  9  2  9  8  8  4   Navy          13 20 23 10 18 13 15  . 20  .   Illinois      14  3  3  7  7  3 10  7  7  6   Iowa          15 16  .  . 23  .  . 14  . 20   Arkansas      16  .  .  . 25  .  .  .  . 16   Memphis State 17  . 11  . 16  8 20  . 15 12   Washington    18  .  .  .  .  .  . 17  .  .   UAB           19 13 10  . 12 17  . 16 16 15   UNLV          20 18 18 19 22  . 14 18 18  .   NC State      21 17 14 16 15  . 12 15 17 18   Maryland      22  .  .  . 19  .  .  . 19 14   Pittsburgh    23  .  .  .  .  .  .  .  .  .   Oklahoma      24 19 17 17 17 12 17  . 13 17   Indiana       25 12 20 18 21  .  .  .  .  .   Virginia      26  . 22  .  . 18  .  .  .  .   Old Dominion  27  .  .  .  .  .  .  .  .  .   Auburn        28 11 12  8 10  7  7 11 10 11   St. Johns     29  .  .  .  . 14  .  .  .  .   UCLA          30  .  .  .  .  .  . 19  .  .   St. Josephs   .  . 19  .  .  .  .  .  .  .   Tennessee      .  . 24  .  . 16  .  .  .  .   Montana        .  .  . 20  .  .  .  .  .  .   Houston        .  .  .  . 24  .  .  .  .  .   Virginia Tech  .  .  .  .  .  . 13  .  .  .   ;   /* PROC MEANS is used to output a data set containing the  */   /* maximum value of each of the newspaper and magazine     */   /* rankings. The output data set, maxrank, is then used    */   /* to set the missing values to the next highest rank plus */   /* thirty-six, divided by two (that is, the mean of the    */   /* missing ranks). This ad hoc method of replacing missing */   /* values is based more on intuition than on rigorous      */   /* statistical theory. Observations are weighted by the    */   /* number of nonmissing values.                            */   /*                                                         */   title 'Pre-Season 1985 College Basketball Rankings';   proc means data=HoopsRanks;   output out=MaxRank   max=CSNMax DurSunMax DurHerMax   WashPostMax USATodayMax SportMax   InSportsMax UPIMax APMax SIMax;   run;  
Output 58.3.1: Summary Statistics for Basketball Rankings Using PROC MEANS
start example
  Pre-Season 1985 College Basketball Rankings   The MEANS Procedure   Variable   Label                                      N            Mean   -----------------------------------------------------------------------   CSN        Community Sports News (Chapel Hill, NC)   30      15.5000000   DurSun     Durham Sun                                20      10.5000000   DurHer     Durham Morning Herald                     24      12.5000000   WashPost   Washington Post                           19      10.4210526   USAToday   USA Today                                 25      13.0000000   Sport      Sport Magazine                            20      10.5000000   InSports   Inside Sports                             20      10.5000000   UPI        United Press International                19      10.0000000   AP         Associated Press                          20      10.5000000   SI         Sports Illustrated                        20      10.5000000   -----------------------------------------------------------------------   Variable   Label                                          Std Dev        Minimum   --------------------------------------------------------------------------------   CSN        Community Sports News (Chapel Hill, NC)      8.8034084      1.0000000   DurSun     Durham Sun                                   5.9160798      1.0000000   DurHer     Durham Morning Herald                        7.0710678      1.0000000   WashPost   Washington Post                              6.0673607      1.0000000   USAToday   USA Today                                    7.3598007      1.0000000   Sport      Sport Magazine                               5.9160798      1.0000000   InSports   Inside Sports                                5.9160798      1.0000000   UPI        United Press International                   5.6273143      1.0000000   AP         Associated Press                             5.9160798      1.0000000   SI         Sports Illustrated                           5.9160798      1.0000000   --------------------------------------------------------------------------------   Variable   Label                                          Maximum   -----------------------------------------------------------------   CSN        Community Sports News (Chapel Hill, NC)     30.0000000   DurSun     Durham Sun                                  20.0000000   DurHer     Durham Morning Herald                       24.0000000   WashPost   Washington Post                             20.0000000   USAToday   USA Today                                   25.0000000   Sport      Sport Magazine                              20.0000000   InSports   Inside Sports                               20.0000000   UPI        United Press International                  19.0000000   AP         Associated Press                            20.0000000   SI         Sports Illustrated                          20.0000000   -----------------------------------------------------------------  
end example
 
  data Basketball;   set HoopsRanks;   if _n_=1 then set MaxRank;   array Services{10} CSN--SI;   array MaxRanks{10} CSNMax--SIMax;   keep School CSN--SI Weight;   Weight=0;   do i=1 to 10;   if Services{i}=. then Services{i}=(MaxRanks{i}+36)/2;   else Weight=Weight+1;   end;   run;   proc princomp data=Basketball n=1 out=PCBasketball standard;   var CSN--SI;   weight Weight;   run;  
Output 58.3.2: Principal Components Analysis of Basketball Rankings Using PROC PRINCOMP
start example
  The PRINCOMP Procedure   Observations          35   Variables             10   Simple Statistics   CSN      DurSun      DurHer    WashPost    USAToday   Mean 13.33640553 13.06451613 12.88018433 13.83410138 12.55760369   StD  22.08036285 21.66394183 21.38091837 23.47841791 20.48207965   Simple Statistics   Sport    InSports         UPI          AP          SI   Mean 13.83870968 13.24423963 13.59216590 12.83410138 13.52534562   StD  23.37756267 22.20231526 23.25602811 21.40782406 22.93219584   Correlation Matrix   CSN DurSun DurHer   CSN       Community Sports News (Chapel Hill, NC) 1.0000 0.6505 0.6415   DurSun    Durham Sun                              0.6505 1.0000 0.8341   DurHer    Durham Morning Herald                   0.6415 0.8341 1.0000   WashPost  Washington Post                         0.6121 0.7667 0.7035   USAToday  USA Today                               0.7456 0.8860 0.8877   Sport     Sport Magazine                          0.4806 0.6940 0.7788   InSports  Inside Sports                           0.6558 0.7702 0.7900   UPI       United Press International              0.7007 0.9015 0.7676   AP        Associated Press                        0.6779 0.8437 0.8788   SI        Sports Illustrated                      0.6135 0.7518 0.7761   Correlation Matrix   Wash                              In   Post    USAToday     Sport    Sports     UPI      AP      SI   CSN      0.6121      0.7456    0.4806    0.6558  0.7007  0.6779  0.6135   DurSun   0.7667      0.8860    0.6940    0.7702  0.9015  0.8437  0.7518   DurHer   0.7035      0.8877    0.7788    0.7900  0.7676  0.8788  0.7761   WashPost 1.0000      0.7984    0.6598    0.8717  0.6953  0.7809  0.5952   USAToday 0.7984      1.0000    0.7716    0.8475  0.8539  0.9479  0.8426   Sport    0.6598      0.7716    1.0000    0.7176  0.6220  0.8217  0.7701   InSports 0.8717      0.8475    0.7176    1.0000  0.7920  0.8830  0.7332   UPI      0.6953      0.8539    0.6220    0.7920  1.0000  0.8436  0.7738   AP       0.7809      0.9479    0.8217    0.8830  0.8436  1.0000  0.8212   SI       0.5952      0.8426    0.7701    0.7332  0.7738  0.8212  1.0000   Eigenvalues of the Correlation Matrix   Eigenvalue    Difference    Proportion    Cumulative   1    7.88601647                      0.7886        0.7886   Eigenvectors   Prin1   CSN         Community Sports News (Chapel Hill, NC)      0.270205   DurSun      Durham Sun                                   0.326048   DurHer      Durham Morning Herald                        0.324392   WashPost    Washington Post                              0.300449   USAToday    USA Today                                    0.345200   Sport       Sport Magazine                               0.293881   InSports    Inside Sports                                0.324088   UPI         United Press International                   0.319902   AP          Associated Press                             0.342151   SI          Sports Illustrated                           0.308570  
end example
 
  proc sort data=PCBasketball;   by Prin1;   run;   proc print;   var School Prin1;   title 'Pre-Season 1985 College Basketball Rankings';   title2 'College Teams as Ordered by PROC PRINCOMP';   run;  
Output 58.3.3: Basketball Rankings Using PROC PRINCOMP
start example
  Pre-Season 1985 College Basketball Rankings   College Teams as Ordered by PROC PRINCOMP   OBS    School             Prin1   1    Georgia Tech   0.58068   2    UNC   0.53317   3    Michigan   0.47874   4    Kansas   0.40285   5    Duke   0.38464   6    Illinois   0.33586   7    Syracuse   0.31578   8    Louisville   0.31489   9    Georgetown   0.29735   10    Auburn   0.09785   11    Kentucky          0.00843   12    LSU               0.00872   13    Notre Dame        0.09407   14    NC State          0.19404   15    UAB               0.19771   16    Oklahoma          0.23864   17    Memphis State     0.25319   18    Navy              0.28921   19    UNLV              0.35103   20    DePaul            0.43770   21    Iowa              0.50213   22    Indiana           0.51713   23    Maryland          0.55910   24    Arkansas          0.62977   25    Virginia          0.67586   26    Washington        0.67756   27    Tennessee         0.70822   28    St. Johns         0.71425   29    Virginia Tech     0.71638   30    St. Joseph's      0.73492   31    UCLA              0.73965   32    Pittsburgh        0.75078   33    Houston           0.75534   34    Montana           0.75790   35    Old Dominion      0.76821  
end example
 

Example 58.4. PRINCOMP Graphics (Experimental)

This example illustrates the experimental ODS graphics in PROC PRINCOMP, using the example in the the Getting Started section on page 3596.

The following statements request plots in PROC PRINCOMP.

  ods html;   ods graphics on;   proc princomp data=Jobratings(drop='Overall Rating'n) n=5;   run;   ods graphics off;   ods html close;  

These graphical displays are requested by specifying the experimental ODS GRAPHICS statement. For general information about ODS graphics, see Chapter 15, Statistical Graphics Using ODS. For specific information about the graphics available in the PRINCOMP procedure, see the ODS Graphics section on page 3613.

The N= 5 option in the PROC PRINCOMP statement sets the number of principal components to 5.

Output 58.4.1 shows the eigenvalue plots. Each point in the plot on the left shows an eigenvalue; each point in the plot on the right shows the (cumulative) proportion of variance explained by each component.

Output 58.4.1: Eigenvalue Scatter Plot (Experimental)
start example
click to expand
end example
 

Output 58.4.2 shows a scatter matrix plot between the first five components. The histogram of each component is displayed in the diagonal element of the matrix.

Output 58.4.2: Component Scores Matrix Plot (Experimental)
start example
click to expand
end example
 

Output 58.4.3 shows a component pattern profile. The Y-axis shows the correlation between a component and a variable. There is one profile for each component. Line patterns are used to differentiate correlations between components.

The nearly horizontal profile from the first component indicates that the first component is mostly correlated evenly across all variables. The second component is positively correlated with the variables Observational Skills and Willingness to Confront Problems and is negatively correlated with the variables Interest in People and Interpersonal Sensitivity .

Output 58.4.3: Component Pattern Plot (Experimental)
start example
click to expand
end example
 

Output 58.4.4 shows a scatter plot of the first and second components. Observation numbers are used as the plotting symbol.

Output 58.4.4: Component Scores Plot ”1st versus 2nd (Experimental)
start example
click to expand
end example
 

Output 58.4.5 shows a scatter plot of the first and third components.

Output 58.4.5: Component Scores Plot ”1st versus 3rd (Experimental)
start example
click to expand
end example
 

Output 58.4.6 shows a scatter plot of the second and third components, displaying density with color. Color interpolation is based on the first component, going from blue (or light gray) (minimum density), magenta (or dark gray) (median density), and to red (or black) (maximum density).

Output 58.4.6: Painted Components Scores Plot ”2nd versus 3rd, Painted by 1st (Experimental)
start example
click to expand
end example
 



SAS.STAT 9.1 Users Guide (Vol. 5)
SAS.STAT 9.1 Users Guide (Vol. 5)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 98

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net