Statistics Seminar: "Towards a Better Understanding of Least Squares Linear Regression"

Arun Kumar Kuchibhotla, University of Pennsylvania

Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the statistics literature.  However, the classical technique of linear regression has not lost its importance in applications.  Although the current literature is more focused on high-dimensional parameter estimation, there is more to understand about the least squares linear regression technique in order to bridge the gap between theory and practice.

The talk consists of three parts, each answering some fundamental questions related to linear regression.  Suppose we have $n$ random vectors (regression-type data) and the least squares linear regression algorithm is applied on this data.  Under what minimal assumptions can one make sense of the slope estimator?  Is there a meaningful quantity being estimated? This is the first fundamental question we answer under minimal assumptions that do not even include independence.

The second question we consider is to ask what the nature of the OLS estimator is if regressors have been sub-selected by some variable selection procedure.  We answer this question in full generality by proving a deterministic uniform-in-model result about linear regression, and this provides an interpretation irrespective of the data-dependent variable selection procedure.

The final question we consider is how to perform statistical inference using the OLS estimator obtained from a variable selection procedure. This problem is exactly the problem of valid Post-Selection Inference (PoSI).  The problem of valid PoSI for the OLS estimator can be solved in multiple ways.  This talk will focus on one approach based on an asymptotic linear representation and a high-dimensional central limit theorem.

All our results are proved without assuming any probability model, and they allow for non-identically distributed random vectors.  In addition, they apply equally to independent and functionally dependent data.

Joint work with the Wharton Linear Models Group including Lawrence Brown, Andreas Buja, Edward George and Linda Zhao.  Some of this talk is based on

Host: Todd Kuffner