Click here for Prof. Sawyer's
home page
HOMEWORK #2 due Tuesday 10-12
Six problems.
Text references are to the textbook, Cody & Smith,
``Applied statistics and the SAS programming language''
NOTE: See the main Math475 Web page for how to organize a homework
assignment using SAS. In particular,
ALWAYS INCLUDE YOUR NAME in a title statement in your SAS
programs, so that your name will appear at the top of each output page.
ALL HOMEWORKS MUST BE ORGANIZED in the following order:
(Part 1) First, your answers to all the problems in the homework,
whether you use SAS for that problem or not. If the problem asks you to
generate a graph or table, refer to the graph or table by page number in
the SAS output (see below). (Xeroxing a page or two from the SAS output or
cutting and pasting into a Word file or TeX source file is also OK.)
(Part 2) Second, all SAS programs that you used to obtain the output for
any of the problems. If possible, similar problems should be done with the
same SAS program. (In other words, write one SAS program for several
problems if that makes things easier, using Better yet would be one SAS
title or title2 statements to separate the problems in
your output.)
(Part 3) Third, all output for all the SAS programs in the previous
step.
If an answer in Part 1 requires a table or a scatterplot that you need to
refer to, make sure that your SAS output has overall increasing (unique)
page numbers and make references to Part 3 by page number, such as
``The scatterplot for Problem 2 part (b) is on page #X in
the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output''
if Part 3 has output from several SAS runs, each of which has its own
Page 3. In that case, either write your own (increasing) page numbers
on the SAS output, or else (for example) refer to ``Page 2-7 in the
SAS output'' (for page 7 in the second set of SAS output) and write
page numbers in the format ``2-7'' at the top of pages in your output.
1. A test is made of the effects of a new drug on people
who are occasional sufferers from a newly discovered allergy that affects
people only during the winter. Eighty (80) people are enrolled in the
study. Forty (40) subjects are first asked if they had allergic symptoms
during a particular year, then given the drug, and then asked again if
they had allergic symptoms after the following year. The other half (40)
are given the drug the first year but not the second year and, again,
asked if had allergic symptoms with and without the drug. Thus, there are
two Yes-or-No responses from each enrollee, and, in particular,
8 individuals had no symptoms with the drug but did have symptoms
without the drug. This experimental design helps to control for variable
severity of the allergy among the subjects. The results were
Table 1. Numbers of individuals with allergic symptoms
with and without a drug over two seasons
Without Drug
Yes No Totals
Yes 11 22 33
With Drug
No 8 39 47
------------------------------------
Totals 19 61 80
(i) On the basis of these data, does the drug tend to change
significantly the incidence of allergy in vulnerable individuals?
(ii) If the drug has an effect, would you recommend the drug to
someone who suffers from this allergy? That is, does the drug help or
hurt?
(Warning: Although the data is in the form of a 2x2 contigency
table, the Pearson chi-square test may not be appropriate. For example, a
large number of (Yes,Yes) counts may simply mean that these particular
individuals would have allergic symptoms no matter what. Similarly, a
large number of (No,No) counts might be due to a subset of the sample who
are almost never affected. Thus all of usable information in the table is
in the (Yes,No) and (No,Yes) counts. Before using either the Pearson or
Fisher exact tests, read about some of the other contingency-table tests
in Chapter 3 of the text.)
2. Suppose that the same treatment is given to patients
suffering from four different but related diseases, which are labeled as
Dis#A, Dis#B, Dis#C, and Dis#D. The numbers of individuals surviving for
or dying within six months were collected in the following table.
Table 2. Morbidity results for four diseases
Dis#A Dis#B Dis#C Dis#D
Surv Die Surv Die Surv Die Surv Die
Treated 250 107 390 702 218 141 317 757
Control 454 240 173 390 488 436 113 348
Note that Dis#B and Dis#D appear to be more severe than the others,
although all four diseases have high mortality rates in both treatment
groups.
(i) Does the treatment have a significant overall positive or negative
effect on mortality over the four strata? Carry out a test that gives you
a single P-value for all four tables and that is not subject to Simpson's
Paradox. Do you accept or reject the hypothesis that treatment has no
effect on survival? Do you get the same results for each of the diseases
separately?
(ii) Is the effect of the treatment positive or negative? That is, do
relatively more treated individuals survive than control individuals?
(Hint: Consider the phi coefficient for each disease.)
(iii) Combine the diseases into one 2x2 table. What is the Pearson
Chi-Square P-value for this possibly-incorrect table? Is this consistent
with your answer to part (i)? What is the phi coefficient for the
combined table? Is it consistent with your results in part (ii)? In
the combined table, do relatively more treated individuals survive than
control individuals, or vice versa?
3. The average output in tons per acre in one season in
30 test plots for cotton grown at five different levels of an insecticide,
ordered by increasing levels of the insecticide, are
Table 3. Output in tons per acre in test plots
for five different levels of an insecticide
Level1 79 79 95 109 118 150
Level2 84 95 100 105 119 135
Level3 109 114 121 123 124 145
Level4 91 106 119 150 151 151
Level5 110 113 129 131 145 165
(i) Construct a plot of the output scores versus insecticide level,
with insecticide level on the X axis. Can you see differences in the
means of the treatment groups? Is there a trend in mean output with
increasing levels of the insecticide?
(ii) Do these scores show a significant variation by treatment group,
as measured by a standard one-way layout ANOVA test? What are the degrees
of freedom of the F-test (both numerator and denominator)? What is the
hypothesis H_0? What is the hypothesis H_1? What is the P-value?
(Hints: If you haven't seen or have forgotten one-way ANOVAs, see
Chapter 7 in the text. If there are k ``treatment groups'' and a
total of N observations over all treatment groups, then ``MS between'' (in
the text's notation) has k-1 degrees of freedom and ``MS error'' has N-k
degrees of freedom. NOTE: In the originally posted version of HW2, the
last sentence was stated in the incorrect form, ``If there are k
``treatment groups'' and n observations per treatment group, then ``MS
between'' (in the text's notation) has k-1 degrees of freedom and ``MS
error'' has n-k degrees of freedom.'' However, hints are not binding and
may occasionally be innocently misleading, and the two degrees of freedom
are in the SAS output.)
(iii) Do the inidividual observed output scores show a significant
increase with increasing amounts of insecticide?
(Hint: That is, do a regression of the observed output scores
on the insecticide level for the five treatment groups, with insecticide
level coded as 1, 2, 3, 4, 5. This assumes, as a rough approximation, that
the amount of insecticide per acre varies linearly over the five
insecticide levels.)
Is there a significant regression of output on insecticide level,
coded in this way? What are the degrees of freedom of the Model F-test,
both numerator and denominator? What is the hypothesis H_0? What is the
hypothesis H_1? What is the P-value?
In this regression, what proportion of the variation in output is
``explained'' by the insecticide level? Is this smaller or larger than
the proportion of variance ``explained'' by the ANOVA in part (ii)?
What is the reason for the difference in significance between the
conclusions of the two procedures?
(iv) Write down the estimated regression line in
part (iii). How much additional cotton output is predicted,
on the average, for each increase in insecticide level in this range of
insecticide levels?
4. A zoo is interested in the dependence of blood
pressure on stress in gnus. Blood pressure and stress (yy
and
stress
) for each of 16 gnus under various conditions of
stress are given in the following table. (In each of the 16 pairs of data
in Table 4, yy
is the first variable and
stress
the second variable.)
Table 4. Blood pressure and stress for 16 gnus
47 3.0 50 1.8 110 7.9 1655 15.7
179 9.1 55 5.2 1310 12.9 2773 15.1
56 3.6 62 2.9 3052 16.8 126 7.2
866 12.6 175 8.6 2731 16.7 249 9.0
(i) Is there a significant regression of yy
on
stress
with this data? What P-value does SAS report? What
is the model R2 ?
(ii) Construct a text plot of blood pressure yy
versus
stress
. Include the predicted values on the same plot with
plot symbol P
as a comparison. Does the plot of
yy
versus stress
look linear? How well does it
follow the predicted values? (Hint: It might look slightly bowed
down in the middle.)
(iii) Construct a plot of the residuals for the regression of
yy
on stress
against stress
. Do
the residuals look consistent with the assumptions of a linear
regression? Do their signs and absolute values appear to be randomly
distributed with respect to stress
? (Hint: The
negative residuals may be bunched together in the center.)
(iv) Try regressing yy
on both stress
and
stress*stress
. (Hint: Introduce a new SAS variable
stress2
for stress*stress
.) What is the new
model R2 ? In a plot of yy
on
stress
, do the predicted values appear to match
yy
more closely? Do the residuals have a more
random-looking plot on stress
? (Hint: Observations
with higher values of stress
may also have larger
residuals.)
(v) Try a regression of logyy=log(yy)
on
stress
and stress*stress
. What is the new
model R2 ? Do the predicted values of
logyy
appear to match the observed values more closely?
Does the residual plot show less dependence on stress
?
5. An experimenter is interested in how a quantity that
she calls zubricity depends on three other quantities called drubness,
viscosity, and speed. The experimenter is fairly certain that drubness has
a significant effect on zubricity, but is not sure about viscosity and
speed. Twenty measurements of zubricity, drubness, viscosity, and speed
are recorded in Table 1. For definiteness, call the variables ZUBRIC,
DRUBNESS, VISCOSTY, and SPEED.
Table 5: Zubricity and Covariates
-----------------------------------
OBS Zubric Drubn Visc Speed
-----------------------------------
1 310 16 27 12
2 210 17 36 10
3 450 24 40 20
4 390 24 44 15
5 780 26 44 8
6 330 28 53 18
7 580 39 55 19
8 330 22 56 24
9 400 29 57 16
10 230 28 58 17
11 470 34 60 24
12 510 35 61 17
13 490 37 66 20
14 450 36 68 11
15 630 46 73 21
16 400 38 78 6
17 760 34 80 22
18 590 47 83 17
19 520 43 84 12
20 540 44 89 17
(i) Is there a significant regression of ZUBRIC on the three covariates?
Use SAS (proc reg
or proc glm
) to find out.
What is the model P-value? What is the model R2? What is the
value of the F-statistic that led to the model P-value? How many degrees
of freedom does it have in its numerator and denominator? How did SAS
arrive at these numbers?
(ii) Which covariates are significant in the Parameter Estimate table in
the output? What are their P-values? How many degrees of freedom do the
T-statistics have for the tests in this table? Is the experimenter correct
that Drubness has a significant effect on Zubricity?
(iii) Obtain residual plots (with the residuals as the Y variable) for the
predicted value, Drubness, Viscosity, and Speed. Do these look all right?
That is, do they look like the residuals are normally distributed with
values that are independent of the X-coordinates? Are they any noticeable
outliers? If so, which observations are they?
(iv) Obtain a list of Studentized residuals and Cook's D values for all
of the observations. Do any of these appear to be out of line? If so,
which ones? Use a criterion of either 3.0 for Studentized residuals or
else 0.700 for Cook's D (or both). (Note: You should be able to
get all of the information that you need for parts (i)-(iv) from either
one run of proc reg
or else one run of proc
glm
plus a proc print
for associated variables.)
(Hints: You should be able to tell which are the offending
observations in the residual plots from the values of their residuals and
predicted values. However, an easier way is to tag the values in the plots
in such a way to make them easy to identify.
For plots generated by plot
statements within a
proc reg
procedure, enter a ``paint'' command like (for
example) paint obs=17 / symbol='X';
BEFORE the
plot
statement, where obs
stands for the ordinal
value of the point (that is, the row number or OBS
value in
the data set), or
For plots generated by proc plot
, enter the
plot
statement as (for example) plot Y*X $ obs;
or plot Y*X='*' $ obs;
. The $
obs
option causes the ordinal value to displayed next to each
plotted point.
NOTE: The originally posted form of HW2 had `ord' instead of `obs',
assuming that `obs' is the first column in Table 5. Both syntaxes
work with `ord' replaced by any SAS variable in the current dataset.)
The experimenter was disappointed with the regression output, since
variables that she thought should have been significant were not
significant. After looking at the data again, the experimenter began to
wonder about observations #5, #10, and #17. After checking with her
technician, she found that the technician's handwriting had been misread,
and that the zubricity value of 780 in observation #5 should have
been 480 and the value of 760 in observation #17 should have been
460. Observation #10 was correct as it was originally recorded.
(v) Change the values of Zubricity in the two incorrect observations and
re-analyze the regression. What is the new model P-value? Model
R2? Is it larger than before?
Which covariates are significant in the new Parameter Estimate table?
Does the output now support the experimenter's hypothesis that Drubness
has a significant effect on Zubricity?
6. For the corrected data in Problem 5, generate the
parameter estimates and the Student-t P-values using SAS's built-in matrix
language, proc iml
. (Hint: See
ThreeRegIml.sas
on the Math475 Web site.)
(i) Did you get the same parameter estimates and P-values as SAS's
built-in regression procedure in part (v) in Problem 5?
(ii) What is the P-value for the significance of Drubness to one digit of
accuracy in exponential notation? (For example, 5x10^{-7} or 3x10^{-3} or
7x10^{-11}. NOTE: In the originally posted version of HW2, `one digit of
accuracy' was replaced by `one degree of freedom'. The examples are
clearer than either wording.)
Top of this page