Click here for Prof. Sawyer's
home page
TAKEHOME FINAL due on or before Mon 12-20=2010 at 4 PM
NOTE: There should be NO COLLABORATION on the takehome final,
other than for the mechanics of using the computer.
Text references are to the textbook, Cody & Smith,
``Applied statistics and the SAS programming language'', 5th edn
NOTE: See the main Math475 Web page for how to organize a homework
assignment or takehome test using SAS. In particular,
ALWAYS INCLUDE YOUR NAME in a title statement in your SAS
programs, so that your name will appear at the top of each output page.
ALL HOMEWORKS MUST BE ORGANIZED in the following order:
(Part 1) First, your answers to all the problems in the homework,
whether you use SAS for that problem or not. If the problem asks you to
generate a graph or table, refer to the graph or table by page number in
the SAS output (see below). (Xeroxing a page or two from the SAS output or
cutting and pasting into a Word file or TeX source file is also OK.)
(Part 2) Second, all SAS programs that you used to obtain the output for
any of the problems. If possible, similar problems should be done with the
same SAS program. (In other words, write one SAS program for several
problems if that makes things easier, using Better yet would be one SAS
title or title2 statements to separate the problems in
your output.)
(Part 3) Third, all output for all the SAS programs in the previous
step.
If an answer in Part 1 requires a table or a scatterplot that you need to
refer to, make sure that your SAS output has overall increasing (unique)
page numbers and make references to Part 3 by page number, such as
``The scatterplot for Problem 2 part (b) is on page #X in
the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output''
if Part 3 has output from several SAS runs, each of which has its own
Page 3. In that case, either write your own (increasing) page numbers
on the SAS output, or else (for example) refer to ``Page 2-7 in the
SAS output'' (for page 7 in the second set of SAS output) and write
page numbers in the format ``2-7'' at the top of pages in your output.
Different parts of problems may not be equally weighted.
Five (5) problems.
Problem 1. Heights and weights for the employees of VaporLock
Software Services are recorded in Table 1. Each table entry has the
height, weight, and sex for one employee, in that order. The employees of
this company are known to be unusual.
Table 1 --- Height, Weight, Gender for 79 VaporLock Employees
69 149 M 66 189 M 82 134 M 60 144 F
71 113 F 69 98 F 72 179 M 65 198 M
58 147 F 74 83 F 61 125 F 69 191 M
70 98 F 66 98 F 64 198 M 61 117 F
68 191 M 68 105 F 74 137 M 72 181 M
77 129 M 70 145 M 64 126 F 75 132 M
78 139 M 75 149 M 74 138 M 74 135 M
72 80 F 61 114 F 66 113 F 67 160 M
73 150 M 70 115 F 72 91 F 61 90 F
68 79 F 76 149 M 67 94 F 69 90 F
59 104 F 61 118 F 69 86 F 68 95 F
57 134 F 56 139 F 70 180 M 78 165 M
68 114 F 73 88 F 58 124 F 63 121 F
69 174 M 65 126 F 77 128 M 79 136 M
66 92 F 67 136 F 66 123 F 78 149 M
68 139 M 70 94 F 62 105 F 71 117 F
65 112 F 77 148 M 70 177 M 59 125 F
76 179 M 63 139 F 70 97 F 69 88 F
76 170 M 72 143 M 71 143 M 80 135 M
78 161 M 58 131 F 69 178 M
(i) For the employees grouped into two samples by sex, what are the
two sample sizes? What are the two sample means for weight? Is there a
significant difference in weight between the two sexes? What is the value
of the t-statistic for the classical two-sample t-test? What is the
P-value?
The classical t-test assumes that the variances of the two samples are
the same. Is this a reasonable assumption for the weights in this case?
What is a P-value for a hypothesis based on this assumption? Does this
P-value mean that it is safe to assume that the variances are the same, or
the opposite?
(ii) What is the Pearson correlation coefficient between height and
weight for the individuals in Table 1, ignoring sex? Is it
significantly different from zero? What is the P-value?
How was the P-value for the Pearson correlation coefficient
calculated? What is the number of degrees of freedom of the test statistic
that is used to calculate the P-value? What is the formula that expresses
this test statistic in terms of rho (the sample correlation coefficient)?
(iii) What are are the Pearson correlation coefficients between height and
weight for employees within each sex? Are they significant? Do they have
the same sign as the correlation coefficient in part~(ii)? How can the
correlation coefficients have one sign within groups but a different sign
for the two groups combined? Construct a height by weight scatter plot
using sex as the plotting symbol to illustrate your answer.
Problem 2. Lengths and widths were measured for two types of aphids
(a small beetle) collected in a semitropical country. The entries in
Table 2 are the lengths and widths, respectively, for 56 aphids.
Units are in tenths of millimeters.
Table 2 --- Lengths and widths (0.1mm) for 56 aphids.
Type A (n=17):
258 237 273 226 287 210 289 231
304 237 309 207 311 237 314 234
319 197 330 216 333 185 335 187
342 189 352 195 357 200 365 201
371 185
Type B (n=39):
239 241 256 228 260 213 266 207
271 226 273 187 278 230 280 220
281 183 284 200 286 191 291 214
292 233 293 199 296 195 296 205
300 228 302 200 303 198 303 203
307 215 312 191 318 229 321 181
322 193 322 193 322 219 323 197
326 217 328 178 328 190 330 178
335 187 339 175 340 191 346 177
346 183 358 178 360 177
(i) (5 pts) Based on these samples, is there a significant difference
in lengths between the two samples of beetles? Is there a significant
difference in the widths? What is the number of degrees of freedom of the
t-distribution in both cases? Based on the sample means, does the longer
type of aphid also tend to be wider?
(ii) (15 pts) Is there a significant difference in height and width
together (that is, in the vectors (height,width) ) between the two
types of aphids? Use a MANOVA test to find out. (A MANOVA test for two
treatment groups is called a Hotelling T^2 test.)
What is the value of the equivalent F statistic? How many degrees of
freedom does this F statistic have in the numerator and how many in the
denominator? What is the P-value? (Hint: See the discussion in
Nreading.sas
and Ncoffee.sas
on the Math475 Web
site, and also in HotLizards.sas
. In all three SAS example
input files, the MANOVA code is at the end.)
Problem 3. A manufacturing company with four factories wants to
control the number of defects in the main product that it manufactures. As
a first step, the company wants to know where most of the variation of the
defects is located: among factories, among groups (workgroups) working
within the same factory, or from month to month within the same workgroup.
The company collects observations on the number of products with defects
in a sample of 100 products for three randomly chosen different months
from workgroups within the four factories. The data is collected in the
following table.
Table 3 --- Product defects by factory and workgroup
Factory1:
Group1 30 15 17 Group2 22 6 31 Group3 21 26 15
Factory2:
Group1 32 30 32 Group2 31 27 21 Group3 32 31 35
Group4 27 50 36 Group5 21 29 34
Factory3:
Group1 20 30 29 Group2 21 27 21 Group3 28 23 33
Group4 14 14 25
Factory4:
Group1 20 26 26 Group2 23 19 17 Group3 36 30 32
Group4 11 29 14 Group5 17 22 35
Note that ``Group1'' does not refer to the same group in different
factories, but only to the first workgroup from that factory that
happened to send data to the parent company. (Treat the three
observations for each workgroup as independent and identically
distributed samples for that workgroup.)
(i) Using within-workgroup variation to estimate the error, was there
significant variation in the numbers of defects over the 12 or more
workgroups in the study, ignoring the factories that contain them? What
is the P-value? What is the degrees of freedom of the resulting F
statistic?
(ii) Analyze the appropriate ANOVA model taking into account both groups
and factories. Is there significant variation in the number of defects by
factory? Is there significant variation by workgroups within factories?
What are the P-values in each case? What are the degrees of freedom of
the two F statistics?
(iii) What are the MSS (Mean Sum of Squares) values for within-workgroup
variation, between-factory variation, and variation between workgroups
within factories? Do these values appear consistent with your answers to
part (ii)?
(iv) Is there significant variation in the number of defects by factory,
ignoring any group structure within each factory? (That is, assuming
that everybody in a factory is in the same workgroup.) What is the
P-value? What are the degrees of freedom of the F statistic? Why is this
P-value different from the P-value for factory in part (ii)?
(v) For the analysis in part (ii), which pairs of factories produced
output that was significantly different in quality? Use the Tukey
multiple-comparison procedure to find out. Does any one factory stand out?
Problem 4. An engineer is interested in the running
temperature of a mechanical device as a function of three variables:
Heat-shield type, with two levels (H1,H2), Fan size, with three types
(F1,F2,F3), and heat baffle type, with five levels (B1,B2,B3,B4,B5). One
observation of the running temperature is made for each set of levels of
the three variables. The running temperatures are listed in Table 4.
Table 4. Running Temperatures of a Device
B1 B2 B3 B4 B5
F1 H1 199 175 187 169 189
H2 196 196 221 196 244
F2 H1 203 182 178 181 193
H2 176 179 217 245 244
F3 H1 177 173 178 184 174
H2 166 204 207 205 284
(i) Run a full-factorial model for H=Heat-shield type, F=Fan size, and
B=Baffle type on the data in Table 4. Since there is only one
observation per cell, you will not obtain any P-values, but you can
compare the MS (mean sum-of-squares) terms for the 7 effects in a
full-factorial model with three factors. Which three effects have the
largest MS terms? which three have the smallest MS terms?
(ii) Using H=Heat-shield type as the major variable, run a split-plot
analysis on the three factors H, F=Fan size, and B=Baffle type. Recall
that this procedure tests significance of the main effects of H, F, and B
and the two interactions H*F and H*B by using the sum of the SS terms for
F*B and H*F*B, or equivalently of the nested SS term F*B(H).
Does this seem like a reasonable procedure in this case, given the
relative size of the MS terms for F*B and H*F*B in part (i)? In the
resulting analysis, which of the five effects being tested are
significant? What are the P-values of the significant effects? Construct
an interaction plot for each interaction that is significant, and interpret
the interatcion plot.
(iii) Note that F and all of its second and higher-order interactions are
non-significant in part (ii), and all have relatively small MS terms
in part (iii). Use this information to declare that the factor F
is ``inactive'' and remove it from the model. This is equivalent to
assuming that values for different levels of F for a fixed (H,B) cell are
independent replications for that cell. Run a full factorial model for H
and B (only) on the data in Table 4. This should give you P-values
for H, B, and H*B, since there are only 10 (H,B) cells and there are 30
values in Tabley 4. Which of the effects for H, B, and H*B are now
significant? What are their P-values? How does this compare with your
results in part (ii)?
(iv) Again assuming that F=Fan size is inactive as in part (iii),
compare the 10 (H,B) combinations using the Tukey multiple-comparison
procedure. Which pairs of these 10 combinations are significantly
different? Does any pair stand out? (Hint: Do a one-way ANOVA on
the pairs (H,B), ignoring the factor F. You can define a variable for
(H,B) pairs by proceeding as in the last SAS procedure in
ThreeWay.sas
on the Math475 Web site, or else by following
one of the suggestions on page 219-220 in the textbook.)
Problem 5. A study of nerve fibers is made for 5 normal and 5
diabetic rats. The experimenter wants to learn how the cross-sectional
areas of the nerve fibers of a particular nerve varies with the diabetic
state, and also how this varies with the position along the nerve fibers
(Proximal, Medial, or Distal). For definiteness, let Group be the factor
whose levels are Normal and Diabetic, and NvLoc a factor with levels
Proximal, Median, and Distal. The nerve cross-sectional areas for the 10
rats are in Table 5.
Table 5 --- Cross-Sectional Areas of Nerve Fibers in 10 Rats
Subj Proximal Medial Distal
Diabetic 1. 529 446 373
Diabetic 2. 604 455 404
Diabetic 3. 523 500 378
Diabetic 4. 504 392 390
Diabetic 5. 518 486 375
Control 6. 394 360 513
Control 7. 352 395 529
Control 8. 370 317 571
Control 9. 261 370 586
Control 10. 348 400 530
(i) Are the two factors Group and NvLoc crossed, nested, or neither? If
they are nested, which is nested under which?
The data also has a third Subject factor, for the 10 individual rats.
Is Subject crossed with Group? nested within Group? crossed with NvLoc?
nested within NvLoc? Why?
(ii) Run a full factorial ANOVA model to test Group, NvLoc, and its
interaction. Use nested Subject effects in the standard way to carry out
the tests. (Hint: See the comments in
NestedSubj2Fac.sas
for the appropriate decomposition of a
full factorial ANOVA model in this case. See comments in
NCoffee.sas
, NReading.sas
, or the text for the
``standard way'' to test effects in nested subject models with one
observation per cell.)
Which of these three effects are significant? highly significant? For
the significant effects, what are the P-values, and what are the degrees
of freedom in the numerator and denominator for the F-distributions
involved?
(iii) Display an interaction plot for the two principal factors, Group and
NvLoc, with the factor with the larger number of levels on the X-axis. Is
an interaction suggested? Why? Is it significant?
(iv) Which of the levels of the main effects of Group and NvLoc are
distinct, using Tukey's method to allow for multiple comparisons? Which
are larger? What do the levels of the main effect of NvLoc mean? Are they
averages for normal rats? diabetic rats? or both?