Instructor: Todd
Kuffner
Lecture:
MWF 4:00-4:50pm
Course
Description: Model selection is ubiquitous in modern
statistical applications. When the model is chosen after viewing
the data, classical procedures for statistical inference are no
longer valid. The study of post-selection inference in the
context of inference for linear regression coefficients after
variable selection is one of the most popular and important
topics in statistics today. In this course, we will explain the
sources of the problem, discuss the different perspectives on
what are the inferential targets and goals, and present
cutting-edge solutions to the problem of post-selection
inference. Paradigms to be studied include high-dimensional or
post-regularization inference, simultaneous inference intended
to control familywise error rates, and selective inference to
control false discovery rates for selected parameters. The
material will be taught at the level of advanced undergraduates,
and is also suitable for graduate students having the necessary
background.
Prerequisite:
Math 493, Math 494, Math 439, and experience using R.
Textbook:
There are no reference books on this topic, as it is very new.
Students will be required to read articles published in
peer-reviewed journals and/or on the arXiv. Lectures will fill
in gaps in students' background knowledge.
Course Topics: Here is a list of potential topics.
We will cover some subset of these, depending on time and
background of the students:
- The variable selection problem and variable selection
methods in linear regression (throughout the course).
- Elements of (i) multiple testing; (ii) bootstrap for
linear regression; and (iii) distribution theory for linear
regression.
- False discovery rate, Benjamini-Hochberg procedure,
familywise error, false coverage statements and selective
type I error.
- Concepts in selective and simultaneous inference, and
differences between full model and submodel viewpoints.
- The `winner's curse', file drawer effect, and other
problems caused by selection.
- Leeb & Potscher's impossibility results, and the
question of whether one can consistently estimate model
selection probabilities.
- Data splitting and data carving.
- The PoSI procedure and refinements.
- Bootstrap inference using lasso estimators -- problems and
solutions.
- The CovTest for the lasso, selective inference after
affine selection procedures based on truncated Gaussian
statistics.
- Inference after marginal screening and inference after
forward selection.
- Procedures to transform sequences of p-values.
- Pseudovariables, knockoffs, and Model X approaches.
- Yekutieli's approach to Bayesian post-selection inference
via truncated likelihoods and selection-adjusted posteriors.
- Post-selection prediction.
- Assumption lean regression and model-robust approaches.
Important
Dates
and
Course Schedule: Details will be posted on Canvas. I
will probably update the table below later in the semester to
detail what was covered for future reference.
Jan. 13
|
First day of classes
|
Jan. 20
|
No class (Martin Luther King Holiday)
|
Jan. 23
|
Last day to drop/add
|
March 9-13
|
No classes (Spring Break)
|
April 24
|
Last day of classes
|
Course Policies and Grades
Canvas:
During the semester, all course-related materials and
announcements will be posted to Canvas and/or sent by email to
registered students.
Grades:
Homework 35%, Paper Discussion 20%, Participation 15%, Final
Project & Group Presentation 30%
Homework: Roughly 1 homework for every 5-6 lectures.
You may discuss problems with other students, but the solutions
you submit must be entirely your own work. Explanations
detailing the steps of proofs or other mathematical arguments
are required for full credit. You are encouraged, but not
required, to write your solutions in TeX/LaTeX, and submit the
printed version. I will drop the lowest homework grade under the
condition that you have submitted all homeworks and genuinely
attempted all of the problems; I will not drop the lowest
homework grade if you did not do this.
Homework assignments may include the following tasks: (i) data
analysis and implementation of post-selection inference
procedures in R (using the newest packages); (ii) designing
simulation experiments and writing R code; (iii) mathematical
derivations; (iv) reproducing simulations or analyses in
academic papers; (v) writing critical analyses of studies
published in applied journals; (vi) critical analysis of
historical statistical literature.
Paper Discussion: For many of the academic papers
that we examine during lectures, we will hold discussions during
lecture. For each of these papers, I will ask two or three
students (separately) to prepare brief comments and questions
(amounting to about 3-5 minutes of speaking) to facilitate the
class discussions; it's best if these students do not talk about
the paper beforehand, to maximize the number of unique points of
view. Each student will be asked to do this for 2 or 3 papers,
depending on the actual pace of the course.
Participation:
Attendance and participation are required for all lectures.
Attendance is not enough. Participation includes: (i) reading
the relevant paper or background material before lecture, and
bringing it with you for reference; (ii) answering
questions that I ask the class, and participating in the class
discussions; (iii) providing a summary,
definition, or result from the previous lecture when I ask you
to.
Final Project & Group Presentation: Groups
will be
assigned after the drop/add deadline. It's a good idea to start
on this project early in the semester, though it cannot be
completed until late in the semester.
- Each group must do their own literature search to find at
least 3 real-world examples of situations in which inference
has been performed after model selection or variable
selection, but without taking selection into account. These
examples must either be from: (i) reputable academic
journals in any field (check with me when you think you've
found something); or (ii) verifiable cases where policy
decisions were made on the basis of such inferences after
model selection. In order to qualify, you must be able to
obtain the data set used in the example, and either have the
author's code or a sufficiently-detailed description of the
analysis that you can confidently implement what is
described (whether it ultimately agrees with what has been
published or not). By `policy decisions' I specifically mean
decisions by governments, institutions, companies, or
regulatory authorities that had a meaningful impact on the
world or the people in it.
- Reproduce the analysis, i.e. the statistical inference, in
the academic paper or which led to the policy decision,
without taking selection into account.
- Using at least two post-selection inference methods
learned in this course, carry out post-selection inference
for the same problem. If the qualitative findings are the
same, give an explanation for why you think this is the
case. If the findings are different, e.g. the policy
decision would change if selection is taken into account,
give an explanation for why you think this is the case. Part
of this process is to check if your post-selection
inferences are robust to violations of modeling assumptions.
- The group must submit a 10-20 page
report, written in LaTeX, as well as the R code and data
sets, and prepare a 25-minute presentation
about one of the examples for the rest of the class using
slides (made with the Beamer
document class in LaTeX). The speaking roles in the
presentation
must be shared equally with all members of the group. The
final
report and presentation will be due during the final two
weeks
of classes.
Final Course Grade: The letter grades for the course will be
determined according to the following numerical grades on a
0-100 scale.
A+
|
impress me
|
B+
|
[87, 90)
|
C+
|
[77, 80)
|
D+
|
[67, 70)
|
F
|
[0,60)
|
A
|
93+
|
B
|
[83, 87)
|
C
|
[73, 77)
|
D
|
[63, 67)
|
|
|
A-
|
[90, 93)
|
B-
|
[80, 83)
|
C-
|
[70, 73)
|
D-
|
[60, 63)
|
|
|
Other
Course
Policies: Students are encouraged to look at the
Faculty of Arts & Sciences
policies.
- Academic integrity:
Students are expected to adhere to the University's policy
on
academic integrity.
- Auditing: There is
an option to audit, but this still involves enrolling in the
course. See the Faculty of Arts & Sciences policy
on
auditing. Auditing students will still be expected to
attend all lectures and compete all required coursework and
exams. A course grade of 75 is required for a successful
audit.
- Collaboration:
Students are encouraged to discuss homework with one
another, but each student must submit separate solutions,
and these must be the original work of the student.
- Exam conflicts:
Read the University policy.
The exam dates for this course are posted before the
semester begins, and thus you are expected to be present at
all exams.
- Late homework:
Only by prior arrangement. If a valid reason for an
exception is not presented at least 36 hours before a
homework due date, then it will not be accepted late (a zero
will be given for that assignment).
- Missed exams:
There are no make-up exams. For valid excused absences with
midterm exams - such as medical, family, transportation and
weather-related emergencies - the contribution of that
midterm to the final course grade will be redistributed
equally to the other midterm exam and final exam. Students
missing both midterm exams and/or the final exam cannot earn
a passing grade for the course.