This example analyzes mean daily temperatures in selected cities in January and July. Both the raw data and the principal components are plotted to illustrate how principal components are orthogonal rotations of the original variables .
The following statements create the Temperature data set:
data Temperature; title 'Mean Temperature in January and July for Selected Cities '; input City -15 January July; cards; Mobile 51.2 81.6 Phoenix 51.2 91.2 Little Rock 39.5 81.4 Sacramento 45.1 75.2 Denver 29.9 73.0 Hartford 24.8 72.7 Wilmington 32.0 75.8 Washington DC 35.6 78.7 Jacksonville 54.6 81.0 Miami 67.2 82.3 Atlanta 42.4 78.0 Boise 29.0 74.5 Chicago 22.9 71.9 Peoria 23.8 75.1 Indianapolis 27.9 75.0 Des Moines 19.4 75.1 Wichita 31.3 80.7 Louisville 33.3 76.9 New Orleans 52.9 81.9 Portland, ME 21.5 68.0 Baltimore 33.4 76.6 Boston 29.2 73.3 Detroit 25.5 73.3 Sault Ste Marie 14.2 63.8 Duluth 8.5 65.6 Minneapolis 12.2 71.9 Jackson 47.1 81.7 Kansas City 27.8 78.8 St Louis 31.3 78.6 Great Falls 20.5 69.3 Omaha 22.6 77.2 Reno 31.9 69.3 Concord 20.6 69.7 Atlantic City 32.7 75.1 Albuquerque 35.2 78.7 Albany 21.5 72.0 Buffalo 23.7 70.1 New York 32.2 76.6 Charlotte 42.1 78.5 Raleigh 40.5 77.5 Bismarck 8.2 70.8 Cincinnati 31.1 75.6 Cleveland 26.9 71.4 Columbus 28.4 73.6 Oklahoma City 36.8 81.5 Portland, OR 38.1 67.1 Philadelphia 32.3 76.8 Pittsburgh 28.1 71.9 Providence 28.4 72.1 Columbia 45.4 81.2 Sioux Falls 14.2 73.3 Memphis 40.5 79.6 Nashville 38.3 79.6 Dallas 44.8 84.8 El Paso 43.6 82.3 Houston 52.1 83.3 Salt Lake City 28.0 76.7 Burlington 16.8 69.8 Norfolk 40.5 78.3 Richmond 37.5 77.9 Spokane 25.4 69.7 Charleston, WV 34.5 75.0 Milwaukee 19.4 69.9 Cheyenne 26.6 69.1 ;
The following statements plot the temperature data set. For information on the %PLOTIT macro, see Appendix B, Using the % PLOTIT Macro.
title2 'Plot of Raw Data'; %plotit(data=Temperature,labelvar=City, plotvars=July January, color=black, colors=black); run;
The results are displayed in Output 58.1.1, which shows a scatter diagram of the 64 pairs of data points with July temperatures plotted against January temperatures.
The following statement requests a principal component analysis on the Temperature data set and outputs the scores to the Prin data set (OUT= Prin ):
proc princomp data=Temperature cov out=Prin; var July January; run;
Mean Temperature in January and July for Selected Cities The PRINCOMP Procedure Observations 64 Variables 2 Simple Statistics July January Mean 75.60781250 32.09531250 StD 5.12761910 11.71243309 Covariance Matrix July January July 26.2924777 46.8282912 January 46.8282912 137.1810888 Total Variance 163.47356647 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 154.310607 145.147647 0.9439 0.9439 2 9.162960 0.0561 1.0000 Eigenvectors Prin1 Prin2 July 0.343532 0.939141 January 0.939141 .343532
Note that January receives a higher loading on Prin1 because it has a higher standard deviation than July , and the PRINCOMP procedure calculates the scores using the centered variables rather than the standardized variables.
The following statement plots the Prin data set created from the previous PROC PRINCOMP statement:
title2 'Plot of Principal Components'; %plotit(data=Prin,labelvar=City, plotvars=Prin2 Prin1, color=black, colors=black); run;
The following data provide crime rates per 100,000 people in seven categories for each of the fifty states in 1977. Since there are seven numeric variables, it is impossible to plot all the variables simultaneously . Principal components can be used to summarize the data in two or three dimensions, and they help to visualize the data. The following statements produce Output 58.2.1:
data Crime; title 'Crime Rates per 100,000 Population by State'; input State -15 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; cards; Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 ; proc princomp out=Crime_Components; run;
Crime Rates per 100,000 Population by State The PRINCOMP Procedure Observations 50 Variables 7 Simple Statistics Murder Rape Robbery Assault Mean 7.444000000 25.73400000 124.0920000 211.3000000 StD 3.866768941 10.75962995 88.3485672 100.2530492 Simple Statistics Burglary Larceny Auto_Theft Mean 1291.904000 2671.288000 377.5260000 StD 432.455711 725.908707 193.3944175 Correlation Matrix Auto_ Murder Rape Robbery Assault Burglary Larceny Theft Murder 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688 Rape 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489 Robbery 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907 Assault 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758 Burglary 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580 Larceny 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442 Auto_Theft 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 4.11495951 2.87623768 0.5879 0.5879 2 1.23872183 0.51290521 0.1770 0.7648 3 0.72581663 0.40938458 0.1037 0.8685 4 0.31643205 0.05845759 0.0452 0.9137 5 0.25797446 0.03593499 0.0369 0.9506 6 0.22203947 0.09798342 0.0317 0.9823 7 0.12405606 0.0177 1.0000 Eigenvectors Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Murder 0.300279 .629174 0.178245 .232114 0.538123 0.259117 0.267593 Rape 0.431759 .169435 .244198 0.062216 0.188471 .773271 .296485 Robbery 0.396875 0.042247 0.495861 .557989 .519977 .114385 .003903 Assault 0.396652 .343528 .069510 0.629804 .506651 0.172363 0.191745 Burglary 0.440157 0.203341 .209895 .057555 0.101033 0.535987 .648117 Larceny 0.357360 0.402319 .539231 .234890 0.030099 0.039406 0.601690 Auto_Theft 0.295177 0.502421 0.568384 0.419238 0.369753 .057298 0.147046
The eigenvalues indicate that two or three components provide a good summary of the data, two components accounting for 76% of the total variance and three components explaining 87%. Subsequent components contribute less than 5% each.
The first component is a measure of overall crime rate since the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on variables Auto_Theft and Larceny and high negative loadings on variables Murder and Assault . There is also a small positive loading on Burglary and a small negative loading on Rape . This component seems to measure the preponderance of property crime over violent crime. The interpretation of the third component is not obvious.
A simple way to examine the principal components in more detail is to display the output data set sorted by each of the large components. The following statements produce Output 58.2.2 through Output 58.2.3:
proc sort data=Crime_Components; by Prin1; run; proc print; id State; var Prin1 Prin2 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; title2 'States Listed in Order of Overall Crime Rate'; title3 'As Determined by the First Principal Component'; run; proc sort data=Crime_Components; by Prin2; run; proc print; id State; var Prin1 Prin2 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; title2 'States Listed in Order of Property Vs. Violent Crime'; title3 'As Determined by the Second Principal Component'; run;
Crime Rates per 100,000 Population by State States Listed in Order of Overall Crime Rate As Determined by the First Principal Component A u B t R A u L o M o s r a _ S P P u b s g r T t r r r R b a l c h a i i d a e u a e e t n n e p r l r n f e 1 2 r e y t y y t North Dakota 3.96408 0.38767 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 South Dakota 3.17203 0.25446 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 West Virginia 3.14772 0.81425 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Iowa 2.58156 0.82475 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Wisconsin 2.50296 0.78083 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 New Hampshire 2.46562 0.82503 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 Nebraska 2.15071 0.22574 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Vermont 2.06433 0.94497 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Maine 1.82631 0.57878 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Kentucky 1.72691 1.14663 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Pennsylvania 1.72007 0.19590 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Montana 1.66801 0.27099 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Minnesota 1.55434 1.05644 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 1.50736 2.54671 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Idaho 1.43245 0.00801 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Wyoming 1.42463 0.06268 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 Arkansas 1.05441 1.34544 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 Utah 1.04996 0.93656 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Virginia 0.91621 0.69265 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 North Carolina 0.69925 1.67027 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 Kansas 0.63407 0.02804 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Connecticut 0.54133 1.50123 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Indiana 0.49990 0.00003 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Oklahoma 0.32136 0.62429 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Rhode Island 0.20156 2.14658 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 Tennessee 0.13660 1.13498 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Alabama 0.04988 2.09610 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 New Jersey 0.21787 0.96421 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 Ohio 0.23953 0.09053 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Georgia 0.49041 1.38079 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Illinois 0.51290 0.09423 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Missouri 0.55637 0.55851 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Hawaii 0.82313 1.82392 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Washington 0.93058 0.73776 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 Delaware 0.96458 1.29674 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Massachusetts 0.97844 2.63105 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Louisiana 1.12020 2.08327 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 New Mexico 1.21417 0.95076 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 Texas 1.39696 0.68131 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Oregon 1.44900 0.58603 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 South Carolina 1.60336 2.16211 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 Maryland 2.18280 0.19474 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Michigan 2.27333 0.15487 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Alaska 2.42151 0.16652 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Colorado 2.50929 0.91660 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Arizona 3.01414 0.84495 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Florida 3.11175 0.60392 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 New York 3.45248 0.43289 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 California 4.28380 0.14319 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Nevada 5.26699 0.25262 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
Crime Rates per 100,000 Population by State States Listed in Order of Property Vs. Violent Crime As Determined by the Second Principal Component A u B t R A u L o M o s r a _ S P P u b s g r T t r r r R b a l c h a i i d a e u a e e t n n e p r l r n f e 1 2 r e y t y y t Mississippi 1.50736 2.54671 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 South Carolina 1.60336 2.16211 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 Alabama 0.04988 2.09610 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Louisiana 1.12020 2.08327 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 North Carolina 0.69925 1.67027 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 Georgia 0.49041 1.38079 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Arkansas 1.05441 1.34544 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 Kentucky 1.72691 1.14663 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Tennessee 0.13660 1.13498 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 New Mexico 1.21417 0.95076 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 West Virginia 3.14772 0.81425 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Virginia 0.91621 0.69265 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Texas 1.39696 0.68131 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Oklahoma 0.32136 0.62429 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Florida 3.11175 0.60392 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Missouri 0.55637 0.55851 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 South Dakota 3.17203 0.25446 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Nevada 5.26699 0.25262 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 Pennsylvania 1.72007 0.19590 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Maryland 2.18280 0.19474 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Kansas 0.63407 0.02804 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Idaho 1.43245 0.00801 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Indiana 0.49990 0.00003 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Wyoming 1.42463 0.06268 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 Ohio 0.23953 0.09053 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Illinois 0.51290 0.09423 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 California 4.28380 0.14319 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Michigan 2.27333 0.15487 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Alaska 2.42151 0.16652 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Nebraska 2.15071 0.22574 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Montana 1.66801 0.27099 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 North Dakota 3.96408 0.38767 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 New York 3.45248 0.43289 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 Maine 1.82631 0.57878 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Oregon 1.44900 0.58603 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Washington 0.93058 0.73776 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 Wisconsin 2.50296 0.78083 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Iowa 2.58156 0.82475 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 New Hampshire 2.46562 0.82503 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 Arizona 3.01414 0.84495 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Colorado 2.50929 0.91660 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Utah 1.04996 0.93656 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 2.06433 0.94497 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 New Jersey 0.21787 0.96421 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 Minnesota 1.55434 1.05644 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Delaware 0.96458 1.29674 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Connecticut 0.54133 1.50123 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Hawaii 0.82313 1.82392 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Rhode Island 0.20156 2.14658 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 Massachusetts 0.97844 2.63105 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1
Another recommended procedure is to make scatter plots of the first few components. The sorted listings help to identify observations on the plots. The following statements produce Output 58.2.4 through Output 58.2.5:
title2 'Plot of the First Two Principal Components'; %plotit(data=Crime_Components,labelvar=State, plotvars=Prin2 Prin1, color=black, colors=black); run; title2 'Plot of the First and Third Principal Components'; %plotit(data=Crime_Components,labelvar=State, plotvars=Prin3 Prin1, color=black, colors=black); run;
It is possible to identify regional trends on the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North and South Dakota are on the extreme left with low overall crime rates. Southeastern states tend to be in the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a greater-than-average ratio of property crime to violent crime.
The most striking feature of the plot of the first and third principal components is that Massachusetts and New York are outliers on the third component.
The data in this example are rankings of 35 college basketball teams . The rankings were made before the start of the 1985_86 season by 10 news services.
The purpose of the principal component analysis is to compute a single variable that best summarizes all 10 of the preseason rankings.
Note that the various news services rank different numbers of teams, varying from 20 through 30 (there is a missing rank in one of the variables, WashPost ). And, of course, each service does not rank the same teams, so there are missing values in these data. Each of the 35 teams is ranked by at least one news service.
The PRINCOMP procedure omits observations with missing values. To obtain principal component scores for all of the teams, it is necessary to replace the missing values. Since it is the best teams that are ranked, it is not appropriate to replace missing values with the mean of the nonmissing values. Instead, an ad hoc method is used that replaces missing values by the mean of the unassigned ranks. For example, if 20 teams are ranked by a news service, then ranks 21 through 35 are unassigned . The mean of ranks 21 through 35 is 28, so missing values for that variable are replaced by the value 28. To prevent the method of missing-value replacement from having an undue effect on the analysis, each observation is weighted according to the number of nonmissing values it has. See Example 59.3 in Chapter 59, The PRINQUAL Procedure, for an alternative analysis of these data.
Since the first principal component accounts for 78% of the variance, there is substantial agreement among the rankings. The eigenvector shows that all the news services are about equally weighted, so a simple average would work almost as well as the first principal component. The following statements produce Output 58.3.1 through Output 58.3.3:
/*-----------------------------------------------------------*/ /* */ /* Preseason 1985 College Basketball Rankings */ /* (rankings of 35 teams by 10 news services) */ /* */ /* Note: (a) news services rank varying numbers of teams; */ /* (b) not all teams are ranked by all news services; */ /* (c) each team is ranked by at least one service; */ /* (d) rank 20 is missing for UPI. */ /* */ /*-----------------------------------------------------------*/ data HoopsRanks; input School . CSN DurSun DurHer WashPost USAToday Sport InSports UPI AP SI; label CSN = 'Community Sports News (Chapel Hill, NC)' DurSun = 'Durham Sun' DurHer = 'Durham Morning Herald' WashPost = 'Washington Post' USAToday = 'USA Today' Sport = 'Sport Magazine' InSports = 'Inside Sports' UPI = 'United Press International' AP = 'Associated Press' SI = 'Sports Illustrated' ; format CSN--SI 5.1; cards; Louisville 1 8 1 9 8 9 6 10 9 9 Georgia Tech 2 2 4 3 1 1 1 2 1 1 Kansas 3 4 5 1 5 11 8 4 5 7 Michigan 4 5 9 4 2 5 3 1 3 2 Duke 5 6 7 5 4 10 4 5 6 5 UNC 6 1 2 2 3 4 2 3 2 3 Syracuse 7 10 6 11 6 6 5 6 4 10 Notre Dame 8 14 15 13 11 20 18 13 12 . Kentucky 9 15 16 14 14 19 11 12 11 13 LSU 10 9 13 . 13 15 16 9 14 8 DePaul 11 . 21 15 20 . 19 . . 19 Georgetown 12 7 8 6 9 2 9 8 8 4 Navy 13 20 23 10 18 13 15 . 20 . Illinois 14 3 3 7 7 3 10 7 7 6 Iowa 15 16 . . 23 . . 14 . 20 Arkansas 16 . . . 25 . . . . 16 Memphis State 17 . 11 . 16 8 20 . 15 12 Washington 18 . . . . . . 17 . . UAB 19 13 10 . 12 17 . 16 16 15 UNLV 20 18 18 19 22 . 14 18 18 . NC State 21 17 14 16 15 . 12 15 17 18 Maryland 22 . . . 19 . . . 19 14 Pittsburgh 23 . . . . . . . . . Oklahoma 24 19 17 17 17 12 17 . 13 17 Indiana 25 12 20 18 21 . . . . . Virginia 26 . 22 . . 18 . . . . Old Dominion 27 . . . . . . . . . Auburn 28 11 12 8 10 7 7 11 10 11 St. Johns 29 . . . . 14 . . . . UCLA 30 . . . . . . 19 . . St. Josephs . . 19 . . . . . . . Tennessee . . 24 . . 16 . . . . Montana . . . 20 . . . . . . Houston . . . . 24 . . . . . Virginia Tech . . . . . . 13 . . . ; /* PROC MEANS is used to output a data set containing the */ /* maximum value of each of the newspaper and magazine */ /* rankings. The output data set, maxrank, is then used */ /* to set the missing values to the next highest rank plus */ /* thirty-six, divided by two (that is, the mean of the */ /* missing ranks). This ad hoc method of replacing missing */ /* values is based more on intuition than on rigorous */ /* statistical theory. Observations are weighted by the */ /* number of nonmissing values. */ /* */ title 'Pre-Season 1985 College Basketball Rankings'; proc means data=HoopsRanks; output out=MaxRank max=CSNMax DurSunMax DurHerMax WashPostMax USATodayMax SportMax InSportsMax UPIMax APMax SIMax; run;
Pre-Season 1985 College Basketball Rankings The MEANS Procedure Variable Label N Mean ----------------------------------------------------------------------- CSN Community Sports News (Chapel Hill, NC) 30 15.5000000 DurSun Durham Sun 20 10.5000000 DurHer Durham Morning Herald 24 12.5000000 WashPost Washington Post 19 10.4210526 USAToday USA Today 25 13.0000000 Sport Sport Magazine 20 10.5000000 InSports Inside Sports 20 10.5000000 UPI United Press International 19 10.0000000 AP Associated Press 20 10.5000000 SI Sports Illustrated 20 10.5000000 ----------------------------------------------------------------------- Variable Label Std Dev Minimum -------------------------------------------------------------------------------- CSN Community Sports News (Chapel Hill, NC) 8.8034084 1.0000000 DurSun Durham Sun 5.9160798 1.0000000 DurHer Durham Morning Herald 7.0710678 1.0000000 WashPost Washington Post 6.0673607 1.0000000 USAToday USA Today 7.3598007 1.0000000 Sport Sport Magazine 5.9160798 1.0000000 InSports Inside Sports 5.9160798 1.0000000 UPI United Press International 5.6273143 1.0000000 AP Associated Press 5.9160798 1.0000000 SI Sports Illustrated 5.9160798 1.0000000 -------------------------------------------------------------------------------- Variable Label Maximum ----------------------------------------------------------------- CSN Community Sports News (Chapel Hill, NC) 30.0000000 DurSun Durham Sun 20.0000000 DurHer Durham Morning Herald 24.0000000 WashPost Washington Post 20.0000000 USAToday USA Today 25.0000000 Sport Sport Magazine 20.0000000 InSports Inside Sports 20.0000000 UPI United Press International 19.0000000 AP Associated Press 20.0000000 SI Sports Illustrated 20.0000000 -----------------------------------------------------------------
data Basketball; set HoopsRanks; if _n_=1 then set MaxRank; array Services{10} CSN--SI; array MaxRanks{10} CSNMax--SIMax; keep School CSN--SI Weight; Weight=0; do i=1 to 10; if Services{i}=. then Services{i}=(MaxRanks{i}+36)/2; else Weight=Weight+1; end; run; proc princomp data=Basketball n=1 out=PCBasketball standard; var CSN--SI; weight Weight; run;
The PRINCOMP Procedure Observations 35 Variables 10 Simple Statistics CSN DurSun DurHer WashPost USAToday Mean 13.33640553 13.06451613 12.88018433 13.83410138 12.55760369 StD 22.08036285 21.66394183 21.38091837 23.47841791 20.48207965 Simple Statistics Sport InSports UPI AP SI Mean 13.83870968 13.24423963 13.59216590 12.83410138 13.52534562 StD 23.37756267 22.20231526 23.25602811 21.40782406 22.93219584 Correlation Matrix CSN DurSun DurHer CSN Community Sports News (Chapel Hill, NC) 1.0000 0.6505 0.6415 DurSun Durham Sun 0.6505 1.0000 0.8341 DurHer Durham Morning Herald 0.6415 0.8341 1.0000 WashPost Washington Post 0.6121 0.7667 0.7035 USAToday USA Today 0.7456 0.8860 0.8877 Sport Sport Magazine 0.4806 0.6940 0.7788 InSports Inside Sports 0.6558 0.7702 0.7900 UPI United Press International 0.7007 0.9015 0.7676 AP Associated Press 0.6779 0.8437 0.8788 SI Sports Illustrated 0.6135 0.7518 0.7761 Correlation Matrix Wash In Post USAToday Sport Sports UPI AP SI CSN 0.6121 0.7456 0.4806 0.6558 0.7007 0.6779 0.6135 DurSun 0.7667 0.8860 0.6940 0.7702 0.9015 0.8437 0.7518 DurHer 0.7035 0.8877 0.7788 0.7900 0.7676 0.8788 0.7761 WashPost 1.0000 0.7984 0.6598 0.8717 0.6953 0.7809 0.5952 USAToday 0.7984 1.0000 0.7716 0.8475 0.8539 0.9479 0.8426 Sport 0.6598 0.7716 1.0000 0.7176 0.6220 0.8217 0.7701 InSports 0.8717 0.8475 0.7176 1.0000 0.7920 0.8830 0.7332 UPI 0.6953 0.8539 0.6220 0.7920 1.0000 0.8436 0.7738 AP 0.7809 0.9479 0.8217 0.8830 0.8436 1.0000 0.8212 SI 0.5952 0.8426 0.7701 0.7332 0.7738 0.8212 1.0000 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 7.88601647 0.7886 0.7886 Eigenvectors Prin1 CSN Community Sports News (Chapel Hill, NC) 0.270205 DurSun Durham Sun 0.326048 DurHer Durham Morning Herald 0.324392 WashPost Washington Post 0.300449 USAToday USA Today 0.345200 Sport Sport Magazine 0.293881 InSports Inside Sports 0.324088 UPI United Press International 0.319902 AP Associated Press 0.342151 SI Sports Illustrated 0.308570
proc sort data=PCBasketball; by Prin1; run; proc print; var School Prin1; title 'Pre-Season 1985 College Basketball Rankings'; title2 'College Teams as Ordered by PROC PRINCOMP'; run;
Pre-Season 1985 College Basketball Rankings College Teams as Ordered by PROC PRINCOMP OBS School Prin1 1 Georgia Tech 0.58068 2 UNC 0.53317 3 Michigan 0.47874 4 Kansas 0.40285 5 Duke 0.38464 6 Illinois 0.33586 7 Syracuse 0.31578 8 Louisville 0.31489 9 Georgetown 0.29735 10 Auburn 0.09785 11 Kentucky 0.00843 12 LSU 0.00872 13 Notre Dame 0.09407 14 NC State 0.19404 15 UAB 0.19771 16 Oklahoma 0.23864 17 Memphis State 0.25319 18 Navy 0.28921 19 UNLV 0.35103 20 DePaul 0.43770 21 Iowa 0.50213 22 Indiana 0.51713 23 Maryland 0.55910 24 Arkansas 0.62977 25 Virginia 0.67586 26 Washington 0.67756 27 Tennessee 0.70822 28 St. Johns 0.71425 29 Virginia Tech 0.71638 30 St. Joseph's 0.73492 31 UCLA 0.73965 32 Pittsburgh 0.75078 33 Houston 0.75534 34 Montana 0.75790 35 Old Dominion 0.76821
This example illustrates the experimental ODS graphics in PROC PRINCOMP, using the example in the the Getting Started section on page 3596.
The following statements request plots in PROC PRINCOMP.
ods html; ods graphics on; proc princomp data=Jobratings(drop='Overall Rating'n) n=5; run; ods graphics off; ods html close;
These graphical displays are requested by specifying the experimental ODS GRAPHICS statement. For general information about ODS graphics, see Chapter 15, Statistical Graphics Using ODS. For specific information about the graphics available in the PRINCOMP procedure, see the ODS Graphics section on page 3613.
The N= 5 option in the PROC PRINCOMP statement sets the number of principal components to 5.
The nearly horizontal profile from the first component indicates that the first component is mostly correlated evenly across all variables. The second component is positively correlated with the variables Observational Skills and Willingness to Confront Problems and is negatively correlated with the variables Interest in People and Interpersonal Sensitivity .
Output 58.4.5 shows a scatter plot of the first and third components.