Statistics and Data Science Seminar: "New results in model validation from cross-validation to concentration inequalities"

Speaker: Morgane Austern, Department of Statistics, Harvard University

Abstract: Estimating and evaluating the generalization capabilities of an estimator is a fundamental task of statistical inference. In this talk we are interested in better understanding how well does the cross-validated risk estimates the risk, and in improving finite sample generalization bounds. In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk, we remark that concentration inequalities are fundamental tools to obtain finite sample generalization guarantees. However, those bounds are known to be loose which can represent some serious limitation for their use in reinforcement learning and other applications in machine learning. Limit theorems, on the other hand, provide asymptotically tight bounds but that are not valid for finite samples. Motivated by this observation, we propose a new method for deriving concentration inequalities that is both valid in finite sample and asymptotically optimal. We demonstrate that the bounds obtained improve on classical concentration inequalities such as the Bernstein or Azuma inequality.

Host: Debashis Mondal