Math 475 Homework 3

Math 475 Homework 3 - Fall 2010

Click here for Prof. Sawyer's home page

HOMEWORK #3 due Tuesday 11-2

Six problems.

Text references are to the textbook, Cody & Smith, ``Applied statistics and the SAS programming language''

NOTE: See the main Math475 Web page for how to organize a homework assignment using SAS. In particular,
ALWAYS INCLUDE YOUR NAME in a title statement in your SAS programs, so that your name will appear at the top of each output page.
ALL HOMEWORKS MUST BE ORGANIZED in the following order:
(Part 1) First, your answers to all the problems in the homework, whether you use SAS for that problem or not. If the problem asks you to generate a graph or table, refer to the graph or table by page number in the SAS output (see below). (Xeroxing a page or two from the SAS output or cutting and pasting into a Word file or TeX source file is also OK.)
(Part 2) Second, all SAS programs that you used to obtain the output for any of the problems. If possible, similar problems should be done with the same SAS program. (In other words, write one SAS program for several problems if that makes things easier, using Better yet would be one SAS title or title2 statements to separate the problems in your output.)
(Part 3) Third, all output for all the SAS programs in the previous step.
If an answer in Part 1 requires a table or a scatterplot that you need to refer to, make sure that your SAS output has overall increasing (unique) page numbers and make references to Part 3 by page number, such as ``The scatterplot for Problem 2 part (b) is on page #X in the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output'' if Part 3 has output from several SAS runs, each of which has its own Page 3. In that case, either write your own (increasing) page numbers on the SAS output, or else (for example) refer to ``Page 2-7 in the SAS output'' (for page 7 in the second set of SAS output) and write page numbers in the format ``2-7'' at the top of pages in your output.

1. The Women's track dataset (WomensTrack.dat) on the Ma475 ``SAS programs covered'' Web page has best-performance scores for the 1984 Olympics for women's track teams from 55 countries for 7 races. The races are abbreviated

m100,
m200, m400, m800, m1500, m3000

and Marathon. (Warning: The first seven lines of WomensTrack.dat are comments.)

Run a regression of the Marathon scores on the six other race scores. What is the P-value for the Model test (for a regression on all six variables)? How many of the six variables are significant (P<=0.05) as individual covariates using Type III criteria? How many are significant using Type I criteria, using the order of the races above? What are the P-values of the significant variables?

Do a model-selection analysis on this regression using the (a) Adjusted R_Square, (b) Forwards, (c) Stepwise, and (d) Backwards variable-selection methods. (Stepwise selection is Forwards selection with a modification to remove variables that later become redundant. See the SAS help documentation for the details.)

What subset of variables (races) does each method choose, using SAS's default settings? Which appear to be the most reasonable to you?

2. Annual reports for the 10 largest US corporations in 1990 are given in Table 1 below.

     Table 1 - Data for the 10 largest US Corporations in 1990
# Source: Fortune Magazine (April 23, 1990) p346-367 Co 1990 Time Inc.
# All numbers are in millions of dollars.
                      Sales   Profits  Assets
    General_Motors   126974    4224    173297
    Ford              96933    3835    160893
    Exxon             86656    3510     83219
    IBM               63438    3758     77734
    General_Electric  55264    3939    128344
    Mobil             50976    1809     39080
    Philip_Morris     39069    2946     38528
    Chrysler          36156     359     51038
    Du_Pont           35209    2480     34715
    Texaco            32416    2413     25636

(i) Do a Principal Components Analysis for these 10 corporations to explain the variability of the financial data in Table 1. How many principal components are required to explain at least 90% of the variation in the data?

(ii) Display a scree plot for the eigenvalues. Do any of the eigenvalues look distinctive?

(iii) As one might have expected in advance, the first Principal Component (Prin1) is a measure of the overall size of the corporation, since larger (or smaller) corporations are likely to have larger (or smaller) amounts of sales, profits, and assets. Note that Sales varies by nearly a factor of four in Table 1 and Assets by more than a factor of six. Often the first principal component (Prin1) in a PCA analysis is a relatively uninteresting measure of overall size and one is primarily interested in the variation that is explained by Prin2 and lower-order principal components.

In this case, what does the second principal component measure? To help understand what Prin2 says about the financial condition of these 10 corporations during this year, sort and display the list of companies by descending values of Prin2. Include Profits, Sales, and Assets in the display, in that order. Which companies are at the top of the list? at the bottom of the list? What can you say about what caused them to be at the top or bottom of the sorted list?

(iv) In general, log transforming the variables will not correct for a ``size'' effect in Prin1, since large values of the variables will be converted to summands that may be just as distinctive to PCA. However, log transforming the data might help to isolate the size effect better, and will also convert differences to ratios that might be more meaningful.

Repeat the analysis with Sales, Profits, and Assets replaced by their logarithms in an attempt to get a clearer picture. Use base-10 logarithms so that the numbers in the displays are more intuitive. (That is, in SAS, use logvar=log10(var) instead of logvar=log(var). Note that log10(10000)=4 but log(10000)=9.21.)

Do you obtain similar results to what you obtained in part (iii)? Are the top five companies the same?

(v) Construct a Prin2*Prin1 plot for the log-transformed data with the first letter of the company name as the plotting symbol in order to illustrate the data. What can you conclude about the relationship between Prin2 and Prin1 for the log-transformed data? Which company seems to be a counterexample for this relationship? (Hint for part (v): In SAS,

proc plot; plot Y*X=Var;
run;

for a text variable Var uses the first letter of Var as the plotting symbol.)

(Hints:Proc factor in SAS makes it easy to generate scree plots and has a slightly nicer factor-loading table, but Proc princomp is easier for other tasks. See the PCA examples on the Math475 Web site.)

3. A biologist is interested in the population structure of the lizard Cophosaurus texanus. Data with three different measurements from 25 lizards in this species are given in Table 2 below. The biologist would like to use these data to show that the lizards in this species are highly variable in shape, perhaps in response to specialization to different subhabitats of their home range.

     Table 2 - Dimensions of a sample of 25 lizards
# From Johnson&Wichern, ``Applied Multivariate Statistical Analysis'',
#   5th ed, Table 1.3, p17, 2002
# Source: J&W say, data courtesy of Kevin E. Bonine
# Mass is in grams. SVL (snout-vent length) and HLS (hind-limb span)
#   are in millimeters.
    Obs     Mass     SVL     HLS
     1      5.526    59.0    113.5
     2     10.401    75.0    142.0
     3      9.213    69.0    124.0
     4      8.953    67.5    125.0
     5      7.063    62.0    129.5
     6      6.610    62.0    123.0
     7     11.273    74.0    140.0
     8      2.447    47.0     97.0
     9     15.493    86.5    162.0
    10      9.004    69.0    126.5
    11      8.199    70.5    136.0
    12      6.601    64.5    116.0
    13      7.622    67.5    135.0
    14     10.067    73.0    136.5
    15     10.091    73.0    135.5
    16     10.888    77.0    139.0
    17      7.610    61.5    118.0
    18      7.733    66.5    133.5
    19     12.015    79.5    150.0
    20     10.049    74.0    137.0
    21      5.149    59.5    116.0
    22      9.158    68.0    123.0
    23     12.132    75.0    141.0
    24      6.978    66.5    117.0
    25      6.890    63.0    117.0

(i) Do a Principal Components Analysis for this sample of lizards to explain the variability of the data in Table 2. How many principal components are required to explain at least 90% of the variation in the data? What percentage of the variation is explained by the first principal component?

Does your analysis support the biologist's conjecture that there is considerable variation in shape among these lizards, other than trivial variation in overall size, which might be due to age or sex?

(ii) Construct a Prin2*Prin1 plot of the data in Table 2 with the observation number next to each point in order to illustrate the data. Note that the scale of Prin2 is more compressed than that of Prin1. What are the Observation numbers of the largest and smallest lizards in this plot, as measured by Prin1?

(Hint: The command plot Y*X='*' $ Obs; in proc plot will put the value of Obs next to each point in the scatterplot, provided that Obs is a variable in the dataset.)

4. A chemical manufacturer tests the output quality of a fermentation process as a function of the amount of 5 different additives that are given the names AA, BB, CC, DD, and EE. Unfortunately, the people who generated the data did not vary the amounts of the different additives in a regular way, which would have made their effects on Quality easier to interpret. The output quality and the amounts of the additives for 30 batches are given in Table 3.

 Table 3: Fermentation Quality as a Function of Five Additives
 -------------------------------------------------------------
  Batch   Quality     AA     BB    CC        DD     EE
  ------------------------------------------------------
    1      2190      1.464    4    48.853     99    1.90  
    2      2475      1.623    3    37.295    108    3.05  
    3      2185      3.106    2    30.980    176    3.75  
    4      1964      0.113    4    43.325     21    2.61  
    5      2115      2.145    3    52.934    146    1.99  
    6      1721      1.717    2    14.751    124    5.36  
    7      2217      2.479    3    35.755    130    3.22  
    8      2879      3.124    2    41.335    184    3.11  
    9      2523      2.144    3    45.212    112    2.65  
   10      2003      1.308    3    30.076     83    3.93  
   11      2733      3.357    3    52.237    193    2.13  
   12      2866      4.778    1    39.955    262    3.15  
   13      2295      0.477    4    41.858     52    3.43  
   14      1994      1.956    3    42.677    127    2.49  
   15      2092      1.587    3    41.965    122    2.81  
   16      2345      2.897    2    43.303    191    2.92  
   17      2788      3.958    2    44.117    227    2.69  
   18      2595      3.284    1    32.195    210    4.16  
   19      2268      3.906    2    32.471    215    3.94  
   20      3032      3.739    1    29.220    238    3.93  
   21      2875      4.777    1    39.695    267    2.82  
   22      2765      2.243    4    63.574    140    1.23  
   23      1900      0.580    3    25.423     48    4.40  
   24      1874      2.270    2    33.537    150    3.95  
   25      2132      1.074    3    37.588     94    3.18  
   26      2125      1.479    2    30.615    116    3.96  
   27      2145      2.093    2    33.474    122    3.81  
   28      2775      3.094    2    36.825    214    3.19  
   29      1979      1.513    2    23.603     96    4.70  
   30      2292      2.280    3    44.234    159    2.70

(i) Is there a significant regression of Quality on the 5 additives? What is the Model P-value? What is the Model R² ? What additives are significant in the Parameter Estimate or Type III SS tables? Which are significant in the Type I SS table (for example in SAS's proc glm output), using the order of the covariates in Table 3? Which are highly significant in each case?

(ii) Run a ridge regression on the data in Table 3 as a way of finding stable estimates of the effect of the covariates in Table 3. Create and display a ridge trace graphic for estimates of the five additive coefficients as a function of the ridge parameter k (or _RIDGE_ in some SAS output). For definiteness, use k=0 to 0.30 with steps of 0.02 for the ridge parameters.

(iii) Find a value of k (or _RIDGE_) such that the variance-inflation factors (VIFs) of all five covariates are 2.0 or lower and such that the parameter estimates in the ridge trace output appear to have approximately stabilized.

What value of k did you choose? How do the ridge-regression estimates of the coefficients for that value of k compare with the coefficients that you found in part (i)?

(Hint: See AppleRidge.sas on the Math475 Web site.)

Top of this page