HOMEWORK #3 due Tuesday 11-2
Six problems.
Text references are to the textbook, Cody & Smith, ``Applied statistics and the SAS programming language''
NOTE: See the main Math475 Web page for how to organize a homework
assignment using SAS. In particular,
ALWAYS INCLUDE YOUR NAME in a title statement in your SAS
programs, so that your name will appear at the top of each output page.
ALL HOMEWORKS MUST BE ORGANIZED in the following order:
(Part 1) First, your answers to all the problems in the homework,
whether you use SAS for that problem or not. If the problem asks you to
generate a graph or table, refer to the graph or table by page number in
the SAS output (see below). (Xeroxing a page or two from the SAS output or
cutting and pasting into a Word file or TeX source file is also OK.)
(Part 2) Second, all SAS programs that you used to obtain the output for
any of the problems. If possible, similar problems should be done with the
same SAS program. (In other words, write one SAS program for several
problems if that makes things easier, using Better yet would be one SAS
title or title2 statements to separate the problems in
your output.)
(Part 3) Third, all output for all the SAS programs in the previous
step.
If an answer in Part 1 requires a table or a scatterplot that you need to
refer to, make sure that your SAS output has overall increasing (unique)
page numbers and make references to Part 3 by page number, such as
``The scatterplot for Problem 2 part (b) is on page #X in
the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output''
if Part 3 has output from several SAS runs, each of which has its own
Page 3. In that case, either write your own (increasing) page numbers
on the SAS output, or else (for example) refer to ``Page 2-7 in the
SAS output'' (for page 7 in the second set of SAS output) and write
page numbers in the format ``2-7'' at the top of pages in your output.
WomensTrack.dat
) on the Ma475 ``SAS programs covered'' Web
page has best-performance scores for the 1984 Olympics for women's track
teams from 55 countries for 7 races. The races are abbreviated m100,
m200, m400, m800, m1500, m3000
and Marathon
.
(Warning: The first seven lines of WomensTrack.dat
are
comments.)
Table 1 - Data for the 10 largest US Corporations in 1990 # Source: Fortune Magazine (April 23, 1990) p346-367 Co 1990 Time Inc. # All numbers are in millions of dollars. Sales Profits Assets General_Motors 126974 4224 173297 Ford 96933 3835 160893 Exxon 86656 3510 83219 IBM 63438 3758 77734 General_Electric 55264 3939 128344 Mobil 50976 1809 39080 Philip_Morris 39069 2946 38528 Chrysler 36156 359 51038 Du_Pont 35209 2480 34715 Texaco 32416 2413 25636
Prin1
) is a measure of the overall size of the
corporation, since larger (or smaller) corporations are likely to have
larger (or smaller) amounts of sales, profits, and assets. Note that Sales
varies by nearly a factor of four in Table 1 and Assets by more than
a factor of six. Often the first principal component (Prin1
)
in a PCA analysis is a relatively uninteresting measure of overall size
and one is primarily interested in the variation that is explained by
Prin2
and lower-order principal components.
Prin2
says about the financial condition
of these 10 corporations during this year, sort and display the list of
companies by descending values of Prin2
. Include
Profits
, Sales
, and Assets
in the
display, in that order. Which companies are at the top of the list? at the
bottom of the list? What can you say about what caused them to be at the
top or bottom of the sorted list?
Prin1
, since large values of the
variables will be converted to summands that may be just as distinctive
to PCA. However, log transforming the data might help to isolate the
size effect better, and will also convert differences to ratios that might
be more meaningful.
logvar=log10(var)
instead of
logvar=log(var)
. Note that log10(10000)=4
but
log(10000)=9.21
.)
Prin2*Prin1
plot for the log-transformed
data with the first letter of the company name as the plotting symbol in
order to illustrate the data. What can you conclude about the relationship
between Prin2
and Prin1
for the log-transformed
data? Which company seems to be a counterexample for this relationship?
(Hint for part (v): In SAS, proc plot; plot Y*X=Var;
run;
for a text variable Var
uses the first letter of
Var
as the plotting symbol.)
Proc factor
in SAS makes it easy to
generate scree plots and has a slightly nicer factor-loading table, but
Proc princomp
is easier for other tasks. See the
PCA
examples on the Math475 Web site.)
Table 2 - Dimensions of a sample of 25 lizards # From Johnson&Wichern, ``Applied Multivariate Statistical Analysis'', # 5th ed, Table 1.3, p17, 2002 # Source: J&W say, data courtesy of Kevin E. Bonine # Mass is in grams. SVL (snout-vent length) and HLS (hind-limb span) # are in millimeters. Obs Mass SVL HLS 1 5.526 59.0 113.5 2 10.401 75.0 142.0 3 9.213 69.0 124.0 4 8.953 67.5 125.0 5 7.063 62.0 129.5 6 6.610 62.0 123.0 7 11.273 74.0 140.0 8 2.447 47.0 97.0 9 15.493 86.5 162.0 10 9.004 69.0 126.5 11 8.199 70.5 136.0 12 6.601 64.5 116.0 13 7.622 67.5 135.0 14 10.067 73.0 136.5 15 10.091 73.0 135.5 16 10.888 77.0 139.0 17 7.610 61.5 118.0 18 7.733 66.5 133.5 19 12.015 79.5 150.0 20 10.049 74.0 137.0 21 5.149 59.5 116.0 22 9.158 68.0 123.0 23 12.132 75.0 141.0 24 6.978 66.5 117.0 25 6.890 63.0 117.0
Prin2*Prin1
plot of the data in
Table 2 with the observation number next to each point in order to
illustrate the data. Note that the scale of Prin2
is more
compressed than that of Prin1
. What are the Observation
numbers of the largest and smallest lizards in this plot, as measured by
Prin1
?
plot Y*X='*' $ Obs;
in
proc plot
will put the value of Obs
next to each
point in the scatterplot, provided that Obs
is a variable in
the dataset.)
4. A chemical manufacturer tests the output quality of a fermentation process as a function of the amount of 5 different additives that are given the names AA, BB, CC, DD, and EE. Unfortunately, the people who generated the data did not vary the amounts of the different additives in a regular way, which would have made their effects on Quality easier to interpret. The output quality and the amounts of the additives for 30 batches are given in Table 3.
Table 3: Fermentation Quality as a Function of Five Additives ------------------------------------------------------------- Batch Quality AA BB CC DD EE ------------------------------------------------------ 1 2190 1.464 4 48.853 99 1.90 2 2475 1.623 3 37.295 108 3.05 3 2185 3.106 2 30.980 176 3.75 4 1964 0.113 4 43.325 21 2.61 5 2115 2.145 3 52.934 146 1.99 6 1721 1.717 2 14.751 124 5.36 7 2217 2.479 3 35.755 130 3.22 8 2879 3.124 2 41.335 184 3.11 9 2523 2.144 3 45.212 112 2.65 10 2003 1.308 3 30.076 83 3.93 11 2733 3.357 3 52.237 193 2.13 12 2866 4.778 1 39.955 262 3.15 13 2295 0.477 4 41.858 52 3.43 14 1994 1.956 3 42.677 127 2.49 15 2092 1.587 3 41.965 122 2.81 16 2345 2.897 2 43.303 191 2.92 17 2788 3.958 2 44.117 227 2.69 18 2595 3.284 1 32.195 210 4.16 19 2268 3.906 2 32.471 215 3.94 20 3032 3.739 1 29.220 238 3.93 21 2875 4.777 1 39.695 267 2.82 22 2765 2.243 4 63.574 140 1.23 23 1900 0.580 3 25.423 48 4.40 24 1874 2.270 2 33.537 150 3.95 25 2132 1.074 3 37.588 94 3.18 26 2125 1.479 2 30.615 116 3.96 27 2145 2.093 2 33.474 122 3.81 28 2775 3.094 2 36.825 214 3.19 29 1979 1.513 2 23.603 96 4.70 30 2292 2.280 3 44.234 159 2.70
Quality
on the
5 additives? What is the Model P-value? What is the
Model R2 ? What additives are significant in
the Parameter Estimate or Type III SS tables? Which are significant
in the Type I SS table (for example in SAS's proc glm
output), using the order of the covariates in Table 3? Which
are highly significant in each case?
AppleRidge.sas
on the Math475 Web
site.)