Math 460: Multivariate Statistical Analysis
Spring 2016

Instructor: Todd Kuffner (kuffner@math.wustl.edu)

Lecture: 2:30-4:00pm, Tuesday and Thursday, Cupples I, Room 218

Office Hours: Monday 3:00-4:00pm, Tuesday/Thursday 1:05-2:00pm in Room 18, Cupples I

Course Overview: This course introduces multivariate statistical analysis. The material will be presented at a level suitable for advanced undergraduate and master's degree students. Topics include: review of some important concepts (likelihood, quadratic forms, random vectors and matrices, multiple regression and variable selection), an overview of classical multivariate statistics, multivariate regression, dimensionality reduction, discriminant analysis and classification. Additional topics will be selected from modern statistical learning methodology. Emphasis will be given to applications using R.

Prerequisite: It is assumed that students are already familiar with probability at the level of Math 493, and have taken a course in linear models, such as Math 439. Familiarity with R is essential. A course in computer programming would be helpful. Knowledge of multivariate calculus and matrix algebra at the level of Math 233 and Math 309, respectively, is assumed.

Piazza: Make sure to enroll in this course on Piazza.

Textbook: There is no required textbook for the course, but I do recommend using additional references. Some good open-access books:
Another good (but not free) reference is: Applied Multivariate Statistical Analysis (Sixth Edition) by Johnson and Wichern

Computing: Familiarity with R is required. You can find many tutorials by clicking here. On the left side under Documentation, select Contributed to see a list of tutorials. Paul Hewson has compiled a wonderful resource page for R packages relevant for multivariate statistical analysis: click here. Also see his textbook link above, which includes material on matrices.

List of Topics (tentative):
Grades: 30% Homework, 35% for each Midterm, 35% Final

Exams: 1 midterm and 1 final.

Homework: The lowest homework grade will be dropped. Homework is due at the beginning of class on the specified due date.

Final Course Grade: The letter grades for the course will be determined according to the following numerical grades on a 0-100 scale.
A+
[98, 100]
B+
[87, 90)
C+
[77, 80)
D+
[67, 70)
F
[0,60)
A
[93, 98)
B
[83, 87)
C
[73, 77)
D
[63, 67)


A-
[90, 93)
B-
[80, 83)
C-
[70, 73)
D-
[60, 63)



Course Schedule: This will be updated regularly. Future assignment due dates are tentative and subject to change.
Week 1
01/18-01/22
Theme: Review
Types and visualizations of multivariate data; introduction to classical multivariate analysis;
random vectors and multivariate normal; matrix decompositions; matrix norms; basics of numerical analysis: error sources (data, truncation, rounding); machine precision; ill-conditioning and condition numbers of matrices; examples in R
Week 2
01/25-01/29
Theme: Random Matrices
Random matrices; sample covariance matrix; Wishart distribution; Hotelling's T-squared; maximum likelihood estimation; application to distribution of eigenvalues
Week 3
02/01-02/05
Theme: Principal Components Analysis
Dimensionality reduction; biplots; scree plots; geometric interpretation; image compression; applications in R

Week 4
02/08-02/12
Theme: Acquiring Multivariate Data and Canonical Correlation Analysis
Web scraping; applications to Twitter; sentiment analysis; R package twitteR

Canonical variate and canonical correlation analysis; examples in R
Week 5
02/15-02/19
Theme: Linear Models Review
Example in R; matrix calculus; the hat matrix; review of vector spaces; geometric interpretation of least squares; decompositions of sums of squares (using orthogonal complements, and using projections); consistency of the normal equations; generalized inverses; projection matrices; Gauss-Markov theorem; properties of idempotent matrices; distributions of quadratic forms (Cochran's theorem); hypothesis testing and confidence intervals

Common problems: collinearity; transformations; omitted variables; non-constant variance; p>n
Week 6
02/22-02/26
Theme: Introduction to High-Dimensional Statistics
Curse of dimensionality and failure of local averaging; geometry of high-dimensional spaces; vanishing volumes of high-dimensional balls (and crust concentration); false positive control in linear regression; poor properties of empirical covariance matrix; computational complexity; inadequacy of classical asymptotics

Gaussian concentration inequality; Lipschitz functions; flattening of multivariate normal density in high dimensions
Week 7
02/29-03/04
Theme: Model Selection in High-Dimensional Linear Regression
Sparsity; Akaike Information Criterion; optimality and decision theory; oracle risk bounds; minimax risk bounds
Week 8
03/07-03/11
Theme: Variable Selection
Convex optimization; Karush-Kuhn-Tucker conditions; Lagrangian duality; subgradients and gradient descent; examples of estimators and convex programs (lasso, elastic net)

Algorithms; gradient descent; least angle regression; SCAD and nonconvex programs; examples in R

R packages: lars, glmnet, flare
Week 9
03/14-03/18
Spring Break
Week 10
03/21-03/25
Theme: Practical Issues
Tuning parameters; cross-validation; nonparametric bootstrap; bootstrap confidence intervals; more examples (Dantzig selector, square root lasso); dimension reduction for regression
Week 11
03/28-04/01
Theme: Post-Selection Inference and Multiple Testing
Selective inference, simultaneous inference; covariance test, spacing test; stability selection; polyhedral lemma; review of multiple testing; FDR, FWER, FCR; Benjamini-Hochberg procedure; sequential testing; ForwardStop

R package: selectiveInference
Week 12
04/04-04/08
Theme: Post-Selection Inference
High-dimensional inference; multi sample splitting; de-sparsified lasso; ridge projection
R packages: hdi, PoSI
Week 13
04/11-04/15
Theme: Multivariate Regression and Classification
Concepts in multivariate regression; testing; linear discriminant analysis; support vector machines
Week 14
04/18-04/22
Theme: Predictive Modeling
Classification and regression trees; bagging; boosting; AdaBoost
Week 15
04/25-04/29
Theme: Predictive Modeling
Artificial neural networks; problems in statistical inference for predictive models
Reading Period
05/02-05/04


Other Course Policies: Students are encouraged to look at the Faculty of Arts & Sciences policies.