Math 408 Take-Home Final

Math 408 Take-Home Final - Spring 2009

TAKEHOME FINAL due Wednesday May 6 by 4:30 PM

Text references are to Hollander and Wolfe, ``Nonparametric Statistical Methods'', 2nd edn.

NOTES: Hand in your homework in the order
(a) Your written answers to all problems, with page references as needed to part (c) below,
(b) The computer source for any computer programs that you used
(c) All output from the programs in part (b)
This will put the emphasis on what you think the answers should be and on your evidence for this. If a reader thinks that your answers are reasonable, then he or she may or may not want to look at your actual output and computer programs.

Six (6) problems. Not all parts of problems are of equal weight.

1. Failure times Y along with a predictor X were recorded in the following table for 40 trials.

    Table 1: Failure times Y and a predictor X
  ----------------------------------------------------
           X      Y                X      Y
         -----------             -----------
     1.   30      3         21.   40    168
     2.   21      5         22.   54    170
     3.   41      5         23.   27    180
     4.   83      6         24.   77    197
     5.   76      9         25.   90    217
     6.   89     17         26.   97    235
     7.   35     22         27.   93    250
     8.   78     23         28.   80    354
     9.   39     27         29.   73    368
    10.   38     31         30.   67    441
    11.   57     31         31.   72    486
    12.   98     38         32.   88    622
    13.   64     40         33.   94    642
    14.   34     42         34.   84    659
    15.   55     56         35.   86    773
    16.   44     62         36.   60    850
    17.   22     64         37.   99    902
    18.   56     66         38.   46   1090
    19.   74    142         39.   66   1658
    20.   43    159         40.   75   4032

(i) What is the Pearson correlation coefficient rho between X and Y for the data in the table? Are the variables Y and X significantly correlated as measured by rho (under the assumption of normally distributed data)? What is the (two-sided) P-value?
(ii) What is the Kendall correlation coefficient tau? Are the variables Y and X significantly correlated as measured by the Kendall test? What is the (two-sided) P-value? Find the two-sided P-value using the large-sample approximation, both with and without the appropriate tie correction. By how much does the tie correction change the two-sided large-sample P-value?
(iii) What is the Spearman correlation coefficient R? Are the variables Y and X significantly correlated as measured by R? What is the (two-sided) P-value? Use the large-sample approximation, ignoring the tie correction if you prefer. (See Section 8.5 in the text).
(iv) Use a permutation test to check the accuracy of the large-sample approximation for the two-sided P-value for the Spearman correlation in part (iii). Use 10,000 random permutations. What is the estimated P-value? What is a 95% confidence interval for the true Spearman P-value, based on your estimated value? Does this 95% confidence interval contain the large-sample approximate P-value from part (iii)? (Hint: Use T = abs(Sum(i=1,nn) (R_i-Rbar)(S_i-Sbar)) where R_i are the ranks of the X values and S_i are the ranks of the Y values, and compare values of T for 10,000 random permutations with the observed value. See, for example, the program NonParmCorr on the Math408 Web site.)

2. Measurements of responses to five stress conditions were measured for four different brands of a particular product. Three different measurements were made for each combination of brand and stress, for a total of 5x4x3=60 observations. (See Table 2.)

    Table 2: Responses under Stress for Four Brands of Products
 ---------------------------------------------------------------------------
 Stress     Brand1           Brand2            Brand3           Brand4
   A:   3.01,3.04,3.03    3.47,3.10,3.37    3.85,3.87,3.47    3.41,3.11,3.09
   B:   2.85,2.51,2.45    3.49,3.45,3.23    3.64,3.19,3.21    3.02,3.33,3.53
   C:   2.62,2.60,2.67    3.11,2.88,2.97    3.52,3.49,3.44    3.08,3.11,3.06
   D:   2.63,2.64,2.51    2.83,3.15,2.81    3.21,3.65,3.22    2.96,2.97,3.11
   E:   2.58,2.60,2.62    3.12,2.71,2.66    3.28,3.25,3.25    2.67,3.12,3.22

(i) Using brands as blocks, is there a significant effect due to stress? Use the large-sample approximation for the nonparametric test described in Section 7.9 to find out. If there is a significant effect due to stress, which stress does it appear to be due to? That is, which of the stresses A-E appear to be associated with ususually small or unusually large responses?
(ii) Using stress instead of brands to define blocks, is there a significant variation in the response effect over the four brands? Use the same procedure to find out. If there is significant variation with brand, which brands appear to be associated with unusually large or unusually small responses?

3. Consider the paired (X,Y) data in Table 1.
(i) Find the coefficients beta and mu in the least-squares regression line Y_i=beta*X_i+mu. What is the P-value for H_0:beta=0, assuming that the data (X_i,Y_i) are normal, using Student-t methods?
(ii) Find the coefficients beta and mu in the regression line Y_i=beta*X_i+mu using Theil's nonparametric procedure. Given beta from Theil's method, estimate the intercept mu as the median of the n=820 Walsh averages of the 40 residuals. Find the P-value for H_0:beta=0 using the large-sample approximation described in Section 9.1.
(iii) Compare the two regression lines in parts (i) and (ii) by computing
(A) the average absolute error, which is S_1/n for S_1=Sum(i=1,n) |Y_i-beta*X_i-mu| and
(B) the RMS error, which is the square root of S_2/n for S_2=Sum(i=1,n) (Y_i-beta*X_i-mu)^2
Which of the two regression lines does better under criterion (A)? under criterion (B)?
(iv) Find the coefficients beta and mu using the rank regression method discussed in Section 9.6 in the text. Given the slope beta, estimate the intercept mu as the median of the n=820 Walsh averages of the 40 residuals.
(v) Compare the regression line in part (iv) with the two regression lines in part (iii). How does it compare using criterion (A)? Using criterion (B)?
(Hint: See the programs RankRegression, TheilRegression, and NonParmCorr on the Math408 Web site.)

4. Consider the paired (X,Y) data in Table 1.

(i) Find a 95% confidence interval for the estimate of beta using Theil's method using the nonparametric procedure discussed in Section 9.3. Does the confidence interval exclude the value zero? In this way, is it consistent with your answer in part (ii) of Problem 3?

(ii) Use the method of bootstrapping on residuals to find a 95% confidence interval for the estimate of beta for the least-squares regression in part (i) of Problem 3. How does this confidence interval compare with the Student-t confidence interval of beta that you would obtain using Student-t methods?

(iii) In part (ii) of this problem, what is the bootstrap bias in the bootstrap confidence interval? What is the bootstrap P-value? Is this comparable to the Student-t P-value from Problem 3?

5. The following 100 observations were made of a random variable X, where were rounded to the nearest integer:

    Table 3: Observations of a random quantity
   -----------------------------------------------
   62   36   65   28   25   80   30   51   84   17
   78   29   41   65   29   25   36   28   88   23
   61   36   36   41   24   83   77   24   27   71
   63   50   81   60   24   64   33   29   48   30
   28   68   48   23   41   20   37   74   50   27
   30   36   74   25   21   19   35   69   70   40
   28   57   63   24   68   73   42   76   72   60
   30   60   59   28   65   69   65   37   66   32
   58   67   30   39   34   75   56   78   75   73
   66   75   31   66   19   84   37   82   74   61

Use these observations to estimate the density of X by using a kernel density estimator based on the standard Gaussian kernel. Calculate and plot the estimated density of X using bandwidths h=1, 4, 7, and 20. Assuming that the density that generated X was smooth to begin with, which bandwidth seems to give the most reasonable estimator of the density of X? Or would you prefer a bandwidth other than these four choices? (If so, which?) (Hint: See the program DensEst.m on the Math408 Web site. Note that DensEst.m does not use any of the Y values in its dataset.)

6. For the paired (X,Y) data in Table 1, estimate the function mu(X) in the nonlinear regression

Y_i = mu(X_i) + error_i

by using a kernel smoother based on the standard Gaussian kernel. Compute and plot estimates of mu(X) based on the bandwidths h=4, 6, and 8. Assuming that the true function mu(X) is smooth, which of these three bandwidthw seems to give the most reasonable estimate of mu(X)? (Or would you prefer to use a bandwidth other than h=4, 6, or 8? If so, which?).
(Hint: See the program NonParRegr1 on the Math408 Web site. HOWEVER, calculate and plot mu(X) for ALL VALUES of X in the range (for example) X=1,...,100, in each case using all observations (X_j,Y_j), instead of calculating mu(X) only at the values X=X_j, as was done in NonParRegr1.m. Note that the density f(X) in DensEst.m was estimated at X=1,...,100 instead of just at X=X_i. Ignore the code in NonParRegr1.m that constructs the linear smoother.)

Top of this page