Titles and Abstracts
Titles/abstracts for the Third Workshop on Higher-Order Asymptotics
and Post-Selection
Inference
(WHOA-PSI)^{3}.
Click here
to go to the main conference page, where you can find more information. Contact: Todd Kuffner, email:
kuffner@wustl.edu
Talks
Karim Abadir, Imperial College London / American University in Cairo
Title: Link of moments before and after transformations, with an application to resampling from fat-tailed distributions
Abstract: Let
x be a transformation of y, whose distribution is unknown. We derive an
expansion formulating the expectations of x in terms of the
expectations of y. Apart from the intrinsic interest in such a
fundamental relation, our results can be applied to calculating E(x) by
the low-order moments of a transformation which can be chosen to give a
good approximation for E(x). To do so, we generalize the approach of
bounding the terms in expansions of characteristic functions, and use
our result to derive an explicit and accurate bound for the remainder
when a finite number of terms are taken. We illustrate one of the
implications of our method by providing accurate naive bootstrap
confidence intervals for the mean of a fat-tailed distribution with an
infinite variance, in which case currently-available bootstrap methods
are asymptotically invalid and unreliable in finite sample.
Genevera Allen,
Rice University
Title: Inference, Computation, and Visualization for Convex Clustering and Biclustering
Abstract: Hierarchical
clustering enjoys wide popularity because of its fast computation, ease
of interpretation, and appealing visualizations via the dendogram and
cluster heatmap. Recently, several have proposed and studied
convex clustering and biclustering which, similar in spirit to
hierarchical clustering, achieve cluster merges via convex fusion
penalties. While these techniques enjoy superior statistical
performance, they suffer from slower computation and are not generally
conducive to representation as a dendogram. In the first part of
the talk, we present new convex (bi) clustering methods and fast
algorithms that inherit all of the advantages of hierarchical
clustering. Specifically, we develop a new fast approximation and
variation of the convex (bi)clustering solution path that can be
represented as a dendogram or cluster heatmap. Also, as one
tuning parameter indexes the sequence of convex (bi)clustering
solutions, we can use these to develop interactive and dynamic
visualization strategies that allow one to watch data form groups as
the tuning parameter varies. In the second part of this talk, we
consider how to conduct inference for convex clustering solutions that
addresses questions like: Are there clusters in my data set? Or, should
two clusters be merged into one? To achieve this, we develop a
new data decomposition in terms of Hotelling's T^2-test that allows us
to use the selective inference paradigm to test multivariate hypotheses
for the first time. We can use this approach to test hypotheses
and calculate confidence ellipsoids on the cluster means resulting from
convex clustering. We apply these techniques to examples from
text mining and cancer genomics. This is joint work with John Nagorski,
Michael Weylandt, and Frederick Campbell.
Rina Foygel Barber,
University of Chicago
Title: Robust inference with the knockoff filter
Abstract: In
this talk, I will present ongoing work on the knockoff filter for
inference in regression. In a high-dimensional model selection problem,
we would like to select relevant features without too many false
positives. The knockoff filter provides a tool for model selection by
creating knockoff copies of each feature, testing the model selection
algorithm for its ability to distinguish true from false covariates to
control the false positives. In practice, the modeling assumptions that
underlie the construction of the knockoffs may be violated, as we
cannot know the exact dependence structure between the various
features. Our ongoing work aims to determine and improve the robustness
properties of the knockoff framework in this setting. We find that when
knockoff features are constructed using estimated feature distributions
whose errors are small in a KL divergence type measure, the knockoff
filter provably controls the false discovery rate at only a slightly
higher level. This work is joint with Emmanuel Candes and Richard
Samworth.
Heather Battey, Imperial College London
Title: Large numbers of explanatory variables
Abstract: The
lasso and its variants are powerful methods for regression analysis
when there are a small number of study individuals and a large number
of potential explanatory variables. There results a single model, while
there may be several models equally compatible with the data. I will
outline a different approach, whose aim is essentially a confidence set
of effective simple representations. A probabilistic assessment of the
method is given and post-selection inference is discussed in connection
with the resulting `confidence set' of models.
Pierre Bellec, Rutgers University
Title: Model selection, model averaging?
Abstract: TBD
Jelena Bradic, UC San Diego
Title: Semi-supervised high-dimensional learning: in search of optimal inference
Abstract: to follow
Andreas Buja, University of Pennsylvania
Title: PoSI under Misspecification in high-dimensions and Construction of PoSI Statistics
Abstract: Berk
et al. (2013) provided valid post-selection inference under a classical
Gaussian linear model. In this talk, I will first present some recent
advances for PoSI under misspecification as well as diverging number of
covariates. After this discussion, we present some deficiencies of the
``max-|t|'' PoSI statistic and provide some remedies. From this, three
different PoSI confidence regions arise which will be compared. Joint
work with Lawrence D. Brown, Arun K. Kuchibhotla, Ed George, Linda
Zhao, Junhui Cai.
Emmanuel Candes, Stanford University
Title: What do we really know about logistic regression? A modern maximum-likelihood theory
Abstract: Logistic
regression is the most popular model in statistics and machine learning
to fit binary outcomes and assess the statistical significance of
explanatory variables. Alongside, there is a classical theory of
maximum likelihood (ML) estimation, which is used by all statistical
software packages to produce inference. In the common modern setting
where the number of explanatory variables is not negligible compared to
the sample size, we show that this theory leads to inferential
conclusions that cannot be trusted. We develop a new theory that
provides expressions for the bias and variance of the ML estimate and
characterizes the asymptotic distribution of the likelihood-ratio
statistic under some assumptions regarding the distribution of the
explanatory variables. This novel theory can be used to provide valid
inference. If time allows, we will also explain how our theory can deal
with regularized logistic regression such as the logistic ridge or the
logistic LASSO.
Hongyuan Cao, Florida State University
Title: Statistical Methods for Integrative Analysis of Multi-Omics Data
Abstract: Genome-wise
complex trait analysis (GCTA) was developed and applied to heritability
analyses on complex traits and more recently extended to mental
disorders. However, besides the intensive computation, previous
literature also limits the scope to univariate phenotype, which ignores
mutually informative but partially independent pieces of information
provided in other phenotypes. Our goal is to use such auxiliary
information to improve power. We show that the proposed method leads to
a large power increase, while controlling the false discovery rate,
both empirically and theoretically. Extensive simulations demonstrate
the advantage of the proposed method over several state-of-the-art
methods. We illustration our methods on dataset from a schizophrenia
study.
Yunjin Choi, National University of Singapore
Title: Community detection via fused penalty
Abstract: In
recent years, community detection has been an active research area in
various fields including machine learning and statistics. While a
plethora of works has been published over a past few years, most of the
existing methods depend on a predetermined number of communities. Given
the situation, determining the proper number of communities is directly
related to the performance of these methods. Currently, there does not
exist a golden rule for choosing the ideal number and people usually
rely on their background knowledge of the domain to make their choices.
To address this issue, we propose a community detection method which
also finds the number of the underlying communities. Central to our
method is fused l1 penalty applied on an induced graph from the given
data. This method yields hierarchically structured communities. At each
level, we use hypothesis test based on post-selection inference
framework to investigate whether the detected community at the given
level has captured the true population level community correctly.
Will Fithian,
UC Berkeley
Title: AdaPT: An interactive procedure for multiple testing with side information
Abstract: We
consider the problem of multiple hypothesis testing with generic side
information: for each hypothesis we observe both a p-value and some
predictor encoding contextual information about the hypothesis. For
large-scale problems, adaptively focusing power on the more promising
hypotheses (those more likely to yield discoveries) can lead to much
more powerful multiple testing procedures. We propose a general
iterative framework for this problem, called the Adaptive p-value
Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p-value rejection threshold and controls the false discovery rate (FDR) in finite samples.At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored p-values,
estimates the false discovery proportion (FDP) below the threshold, and
either stops to reject or proposes another threshold, until the
estimated FDP is below $\alpha$. Our procedure is adaptive in an
unusually strong sense, permitting the analyst to use any statistical
or machine learning method she chooses to estimate the optimal
threshold, and to switch between different models at each iteration as
information accrues.
This is joint work with Lihua Lei.
Jan Hannig, UNC Chapel Hill
Title: Model Selection without penalty using Generalized Fiducial Inference
Abstract: R.
A. Fisher, the father of modern statistics, developed the idea of
fiducial inference during the first half of the 20th century.
While his proposal led to interesting methods for quantifying
uncertainty, other prominent statisticians of the time did not accept
Fisher's approach as it became apparent that some of Fisher's bold
claims about the properties of fiducial distribution did not hold up
for multi-parameter problems. Beginning around the year 2000, the
authors and collaborators started to re-investigate the idea of
fiducial inference and discovered that Fisher's approach, when properly
generalized, would open doors to solve many important and difficult
inference problems. They termed their generalization of Fisher's
idea as generalized fiducial inference (GFI). The main idea of GFI is
to carefully transfer randomness from the data to the parameter space
using an inverse of a data generating equation without the use of Bayes
theorem. The resulting generalized fiducial distribution (GFD) can then
be used for inference. After more than a decade of investigations, the
authors and collaborators have developed a unifying theory for GFI, and
provided GFI solutions to many challenging practical problems in
different fields of science and industry. Overall, they have
demonstrated that GFI is a valid, useful, and promising approach for
conducting statistical inference.
Standard penalized methods of variable selection and
parameter estimation rely on the magnitude of coefficient estimates to
decide which variables to include in the final model. However,
coefficient estimates are unreliable when the design matrix is
collinear. To overcome this challenge an entirely new perspective
on variable selection is presented within a generalized fiducial
inference framework. This new procedure is able to effectively
account for linear dependencies among subsets of covariates in a
high-dimensional setting where $p$ can grow almost exponentially in
$n$, as well as in the classical setting where $p \le n$. It is
shown that the procedure very naturally assigns small probabilities to
subsets of covariates which include redundancies by way of explicit
$L_{0}$ minimization. Furthermore, with a typical sparsity
assumption, it is shown that the proposed method is consistent in the
sense that the probability of the true sparse subset of covariates
converges in probability to 1 as $n \to \infty$, or as $n \to \infty$
and $p \to \infty$. Very reasonable conditions are needed, and
little restriction is placed on the class of possible subsets of
covariates to achieve this consistency result.
(Joint work with Jonathan Williams)
Lucas Janson,
Harvard University
Title: Should We Model X in High-Dimensional Inference?
Abstract: For
answering questions about the relationship between a response variable
Y and a set of explanatory variables X, most statistical methods focus
their assumptions on the conditional distribution of Y given X (or Y |
X for short). I will describe some benefits of shifting those
assumptions from the conditional distribution Y | X to the joint
distribution of X, especially when X is high-dimensional. First,
assuming a model for X can often more closely match available domain
knowledge, and allows for model checking and robustness that is
unavailable when modeling Y | X. Second, there are substantial
methodological payoffs in terms of interpretability, flexibility of
models, and adaptability of algorithms for quantifying a hypothesized
effect, all while being guaranteed exact (non-asymptotic) inference. I
will briefly mention some of my recent and ongoing work on methods for
high-dimensional inference that model X instead of Y | X, as well as
some challenges and interesting directions for the future in this area.
Jessie Jeng, NC State University
Title: Efficient Signal Inclusion in Large-Scale Data Analysis
Abstract: This
work addresses the challenge of efficiently capturing a high proportion
of true signals for subsequent data analyses when signals are
detectable but not identifiable. We develop a new analytic framework
focusing on false negative control under dependence. We propose the
signal missing rate as a new measure to account for the variability of
false negative proportion. Novel data-adaptive procedures are developed
to control signal missing rate without incurring unnecessary false
positives under dependence. The proposed methods are applied to GWAS on
human heights to effectively remove irrelevant SNPs while retaining a
high proportion of relevant SNPs for subsequent polygenic analysis.
Tracy Ke, Harvard University
Title: Covariate assisted variable ranking
Abstract: TBD
Eric Laber, NC State University
Title: Sample Size Calculations for SMARTs
Abstract: Sequential
Multiple Assignment Randomized Trials (SMARTs) are considered the gold
standard for estimation and evaluation of treatment regimes. SMARTs are
typically sized to ensure sufficient power for a simple comparison,
e.g., the comparison of two fixed and non-overlapping treatment
sequences. Estimation of an optimal treatment regime is conducted
as part of a secondary and hypothesis-generating analysis with formal
evaluation of the estimated optimal regime deferred to a follow-up
trial. However, running a follow-up trial to evaluate an estimated
optimal treatment regime is costly and time-consuming; furthermore, the
estimated optimal regime that is to be evaluated in such a follow-up
trial may be far from optimal if the original trial was underpowered
for estimation of an optimal regime. We derive sample size
procedures for a SMART that ensure: (i) sufficient power for comparing
the optimal treatment regime with standard of care; and (ii) the
estimated optimal regime is within a given tolerance of the true
optimal regime with high-probability. We establish asymptotic validity
of the proposed procedures and demonstrate their finite sample
performance in a series of simulation experiments.
Soumendra Lahiri, NC State University
Title: On limit horizons in high dimensional inference
Abstract: We
consider a common situation arising in many high dimensional
statistical inference problems where the dimension $d$ diverges with
the sample size $n$ and the statistic of interest is given by a
function of component-wise summary statistics. The limit distribution
of the statistic of interest is often influenced by an intricate
interplay of underlying dependence structure of the component-wise
summary statistics. Here, we introduce a new concept, called limit
horizon (L.H.) that gives the boundary of the growth rate of $d$ as a
function of $n$ where the natural approach to deriving the limit law by
iterated limits works. Further, for $d$ growing at a faster rate beyond
the L.H., the natural approach breaks down. We investigate the L.H. in
some specific high dimensional problems.
Liza Levina, University of Michigan
Title: Matrix completion in network analysis
Abstract: Matrix
completion is an active area of research in itself, and a natural tool
to apply to network data, since many real networks are observed
incompletely and/or with noise. However, developing effective
matrix completion algorithms for networks requires taking into account
network- and task-specific missing data patterns. This talk will
discuss three examples of matrix completion used for network
tasks. First, we discuss the use of matrix completion for
cross-validation on networks, a long-standing problem in network
analysis. Two other examples focus on reconstructing incompletely
observed networks, with structured missingness resulting from network
sampling mechanisms. One scenario we consider is egocentric
sampling, where a set of nodes is selected first and then their
connections to the entire network are observed. Another
scenario focuses on data from surveys, where people are asked to name a
given number of friends. We show that matrix
completion can generally be very helpful in solving network problems,
as long as the network structure is taken into account.
This talk is based on joint work with Tianxi Li, Yun-Jhong Wu, and Ji Zhu.
Joshua Loftus, New York University
Title: Model selection bias invalidates goodness of fit tests
Abstract: We
study goodness of fit tests in a variety of model selection settings
and find that selection bias generally makes such tests conservative.
Since selection methods choose the "best" model, a goodness of fit test
will usually fail to reject, even when the incorrect model has been
chosen. This is troubling, as it implies these tests in practice do not
actually provide evidence in favor of the chosen model. We also explore
post selection inference methods for adjusting goodness of fit tests
analytically for simple examples and with simulations in more realistic
settings.
Po-Ling Loh,
University of Wisconsin
Title: Scale calibration for high-dimensional robust regression
Abstract: We
present a new method for high-dimensional linear regression when a
scale parameter of the error is unknown. The proposed estimator is
based on a penalized Huber M-estimator, for which theoretical results
on estimation error have recently been proposed in high-dimensional
statistics literature. However, variance of the error term in the
linear model is intricately connected to the parameter governing the
shape of the Huber loss. The main idea is to use an adaptive technique,
based on Lepski's method, to overcome the difficulties in solving a
joint nonconvex optimization problem with respect to the location and
scale parameters.
Taps Maiti, Michigan State University
Title: High Dimensional Discriminant Analysis for Spatially Dependent Data
Abstract: Linear discriminant analysis (LDA) is one of the most classical and popular classification techniques. However, it performs poorly in high-dimensional classification. Many sparse discriminant methods have been proposed to make LDA applicable in high dimensional case. One issue
of those methods is the structure of the covariance among features is
ignored. We propose a new procedure for high dimensional discriminant
analysis for spatially correlated data. Penalized maximum likelihood
estimation (PMLE) is developed for feature selection and parameter
estimation. Tapering technique is applied to reduce computation load.
The theory shows that the method proposed can achieve consistent
parameter estimation, features selection, and asymptotically optimal
misclassification rate. Extensive simulation study shows a significant
improvement in classification performance under spatial dependence.
Xiao-Li Meng,
Harvard University
Title: Was there ever a pre-selection inference?
Abstract: This talk is dedicated to the memory of Larry Brown. Post-selection
inference has become a buzz word in statistics, which seems to imply
that there was an era of pre-selection inference. But statistical
inference has always been post-selection even in the narrow sense of
model selection. Any goodness-of-fit test, for example, restricts our
model class by empirical data, and hence it alters the relevant
replications for generating our inferential statements. In general
practice, we have at least seven S(ins) to worry about: selection in
hypotheses; selection in data; selection in methodologies; selection in
due diligence and debugging; selection in publications; selection in
reporting and summary; and selection in understanding and
interpretation. Any of such selections, if not accounted for, threatens
the reproducibility and replicability of the inferential findings. Yet
none of them can be reasonably quantified for the purposes of making
post-selection adjustment. One way to combat this seemingly hopeless
problem is to adopt the "expiration date" mentality. The expiration
date of a medication has to be set as a low bound on the duration of
efficacy, not some average duration, in order to guarantee the quality
of the treatment. Hence using bounds is not much about being
conservative, but about ensuring our procedures deliver what they
promise, e.g., verifiably realizing their claimed confidence coverage,
as in Berk, Brown, Buja, Zhang, and Zhao (2013, Annals of Statistics).
The simple strategy of doubling variance will be used to illustrate
this emphasis on quality assurance, in the context of guarding against
model misspecification for constructing confidence intervals as well as
uncongeniality in multiple imputation (Xie and Meng, 2017, Statistica
Sinica).
Aaditya Ramdas, Carnegie Mellon University
Title: Towards "simultaneous selective inference" : a new framework for multiple testing
Abstract: Modern
data science is often exploratory in nature, with hundreds or thousands
of hypotheses being regularly tested on scientific datasets. The false
discovery rate (FDR) has emerged as a dominant error metric in multiple
hypothesis testing over the last two decades. I will argue that both
(a) the FDR error metric, as well as (b) the current framework of
multiple testing, where the scientist picks an arbitrary target error
level (like 0.05) and the algorithm returns a set of rejected null
hypotheses, may be rather inappropriate for exploratory data analysis.
I will show that, luckily, most existing FDR algorithms
(BH, STAR, LORD, AdaPT, Knockoffs, and several others) naturally
satisfy a more uniform notion of error, yielding simultaneous
confidence bands for the false discovery proportion through the entire
path of the algorithm. This makes it possible to flip the traditional
roles of the algorithm and the scientist, allowing the scientist to
make post-hoc decisions after seeing the realization of an algorithm on
the data. For example, the scientist can instead achieve an error
guarantee for all target error levels simultaneously (and hence for any
data-dependent error level). Remarkably, there is a relatively small
price for this added flexibility, the analogous guarantees being less
than a factor of 2 looser than if the error level was prespecified. The
theoretical basis for this advance is founded in the theory of
martingales : we move from optional stopping (used in FDR proofs) to
optional spotting by proving uniform concentration bounds on relevant
exponential supermartingales. This is joint work with Eugene
Katsevich.
Nancy Reid, University of Toronto
Title: A new look at F-tests
Abstract: Directional inference for vector parameters based on higher order approximations in likelihood inference is discussed in Davison
et al. (JASA, 2014) and Fraser et al. (Biometrika, 2016). Here we
explore examples of directional inference where the calculations can be
simplified, and find that in several classical situations the
directional test is equivalent to the usual F-test. This is joint work
with Andrew McCormack, Nicola Sartori and Sri-Amirthan Theivendran.
Alessandro Rinaldo, Carnegie Mellon University
Title: Optimal Rates For Density-Based Clustering Using DBSCAN
Abstract: We study the problem of optimal estimation of the density cluster tree under various assumptions on the underlying density. We
formulate a new notion of clustering consistency which is better suited
to smooth densities, and derive minimax rates of consistency for
cluster tree estimation for H ̈older smooth densities. We present a
computationally efficient, rate optimal cluster tree estimator based on
a straightforward extension of the popular density-based clustering
algorithm DBSCAN. The resulting optimal rates for cluster tree
estimation depend on the degree of smoothness of the underlying density
and, interestingly, match minimax rates for density estimation under
the supremum norm. We also consider level set estimation and
cluster consistency for densities with jump discontinuities, where the
sizes of
the jumps and the distance among clusters are allowed to vanish as the
sample size increases. We demonstrate that our DBSCAN-based algorithm
remains minimax rate optimal in this setting as well. Joint work with
Daren Wang and Xinyang Lu.
Richard Samworth, University of Cambridge
Title: Classification with imperfect training labels
Abstract: We
study the effect of imperfect training data labels on the
performance of classification methods. In a general setting, where
the probability that an observation in the training dataset is
mislabelled may depend on both the feature vector and the true label,
we bound the excess risk of an arbitrary classifier trained with
imperfect labels in terms of its excess risk for predicting a noisy
label. This reveals conditions under which a classifier trained with
imperfect labels remains consistent for classifying uncorrupted test
data points. Furthermore, under stronger conditions, we derive detailed
asymptotic properties for the popular $k$-nearest neighbour (knn),
Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA)
classifiers. One consequence of these results is that the knn and SVM
classifiers are robust to imperfect training labels, in the sense that
the rate of convergence of the excess risks of these classifiers
remains unchanged; in fact, it even turns out that in some cases,
imperfect labels may improve the performance of these methods. On the
other hand, the LDA classifier is shown to be typically inconsistent in
the presence of label noise unless the prior probabilities of each
class are equal.
Joint work with Tim Cannings and Yingying Fan.
Ana-Maria Staicu, NC State University
Title: Variable selection in functional linear model with varying smooth effects
Abstract: State-of-the-art
robotic hand prosthetics generate finger and wrist movement through
pattern recognition (PR) algorithms using features of forearm
electromyogram (EMG) signals, but requires extensive training and is
prone to poor predictions for conditions outside the training data
(Scheme et al., 2010; Peerdeman et al., 2011). We propose a novel
approach to develop a dynamic robotic limb by utilizing the recent
history of EMG signals in a model that accounts for physiological
features of hand movement which are ignored by PR algorithms. We do
this by viewing EMG signals as functional covariates and develop a
functional linear model that quantifies the effect of the EMG signals
on finger/wrist velocity through a bivariate coefficient function that
is allowed to vary with current finger/wrist position. The model is
made parsimonious and interpretable through a two-step variable
selection procedure, called Sequential Adaptive Functional Empirical
group LASSO (SAFE-gLASSO). Numerical studies show excellent selection
and prediction properties of SAFE-gLASSO compared to popular
alternatives. For our motivating dataset, the method correctly
identifies the few EMG signals that are known to be important for an
able-bodied subject with negligible false positives and the model can
be directly implemented in a robotic prosthetic.
Jonathan Taylor,
Stanford University
Title: Approximate selective inference via maximum likelihood
Abstract: We
consider an approximate version of the conditional approach to
selective inference (after randomization). Approximation is used to
bypass potentially expensive MCMC sampling in moderate dimensions. We
use a large-deviations approximation from arxiv.org/1703.06176
(Panigrahi and Taylor), which leads to tractable estimating equations
for the (approximate) maximum likelihood estimator and observed Fisher
information. Through simulations we investigate the promise of this
approach in low and higher dimensions. One clear upside of this
approximation is that it allows the data analyst to pose several
questions of the data before forming a target of interest, with
questions being derived from convex problems as described in
arxiv.org/1609.05609 (Tian et al.)
In terms of downside, theoretical justification seems difficult,
particularly due to the lack of parameters to tweak having already
reached asymptopia.
This is joint work with Snigdha Panigrahi.
Rob Tibshirani, Stanford University
Title: Some new ideas for post selection inference and model assessment
Abstract: TBD
Ryan Tibshirani, Carnegie Mellon University
Title: The LOCO parameter: the good, the bad, and the ugly (or: How I learned to stop worrying and love prediction)
Abstract: Assumption-free
or assumption-lean inference has been gaining more and more attention
these days. A unsettled question, at least in the presenter's
mind, is: what is an interesting parameter to study, when no real model
is assumed to be correct? This will be a mostly non-technical
talk, discussing different approaches for selective inference, and what
parameters (and "models", if any) they are centered around. A
focus will be the LOCO parameter proposed in Lei et al. (2018), its
strengths, weaknesses, and a natural population-level analog.
Lan Wang, University of Minnesota
Title: A Tuning-free Approach to High-dimensional Regression
Abstract: We
introduce a new tuning-free approach for high-dimensional regression
with theoretical guarantee. The new procedure possesses several
appealing properties simultaneously. Computationally, it can be
efficiently solved via linear programming with an easily simulated
tuning parameter, which automatically adapts to both the unknown random
error distribution and the correlation structure of the design matrix.
It is robust with substantial efficiency gain for heavy-tailed random
errors while maintains high efficiency for normal random errors. It
enjoys an essential scale-equivariance property that permits coherent
interpretation when the response variable undergoes a scale
transformation, a desirable property possessed by the classical least
squares estimator but lost by Lasso and its variants. Under weak
conditions for the random error distribution, we establish a
finite-sample error bound with a near-oracle rate for the new estimator
with the simulated tuning parameter. (Joint work with Bo Peng, Jelena
Bradic, Runze Li and Yunan Wu)
Cun-Hui Zhang, Rutgers University
Title: Higher Criticism, SPRT and Test of Power One
Abstract: We
develop a one-sided sequential probability ratio test for multiple null
hypotheses with nearly optimal power in detecting the presence of
signals which are rare and weak. This makes an interesting connection
between test of power one and higher criticism, both involving the law
of iterated logarithm. The sequential test guarantees the prescribed
probability of type-I error. Nonlinear renewal theory is applied to
show that the test is not overly conservative. This is joint work with
Wenhua Jiang.
Hao Helen Zhang, University of Arizona
Title: Oracle P-value and Variable Screening
Abstract: P-value,
first proposed by Fisher to measure inconsistency of data with a
specified null hypothesis, plays a central role in statistical
inference. For classical linear regression analysis, it is a standard
procedure to calculate P-values for regression coefficients based on
least squares estimator (LSE) to determine their significance. However,
for high dimensional data when the number of predictors exceeds the
sample size, ordinary least squares are no longer proper and there is
not a valid definition for P-values based on LSE. It is also
challenging to define sensible P-values for other high dimensional
regression methods such as penalization and resampling methods. In this
paper, we introduce a new concept called oracle P-value to generalize
traditional P-values based on LSE to high dimensional sparse regression
models. Then we propose several estimation procedures to approximate
oracle P-values for real data analysis. We show that the oracle P-value
framework is useful for developing new tools in high dimensional data
analysis, including variable ranking, variable selection, and screening
procedures with false discovery rate (FDR) control. Numerical examples
are then presented to demonstrate performance of the proposed methods.
This is the joint work with Ning Hao.
Kai Zhang,
UNC Chapel Hill
Title: BET on Independence
Abstract: We
study the problem of nonparametric dependence detection. Many existing
methods suffer severe power loss due to non-uniform consistency,
which we illustrate with a paradox. To avoid such power loss, we
approach the nonparametric test of independence through the new
framework of binary expansion statistics (BEStat) and binary expansion
testing (BET), which examine dependence through a novel binary
expansion filtration approximation of the copula. Through a Hadamard
transform, we find that the cross interactions of binary variables in
the filtration are complete sufficient statistics for dependence. These
interactions are also uncorrelated under the null. By utilizing these
interactions, the BET avoids the problem of non-uniform consistency and
improves upon a wide class of commonly used methods (a) by achieving
the minimax rate in sample size requirement for reliable power and (b)
by providing clear interpretations of global relationships upon
rejection of independence. The binary expansion approach also connects
the test statistics with the current computing system to facilitate
efficient bitwise implementation. We illustrate the BET with a study of
the distribution of stars in the night sky and with an exploratory data
analysis of the TCGA breast cancer data.
Linda Zhao,
University of Pennsylvania
Title: Generalized CP (GCp) in a model lean framework
Abstract: Linear
models as working models have performed very well in practice. But most
often the theoretical properties are obtained under the usual linear
model assumptions such as linearity, homoscedasticity and normality.
Using the least squared estimators we justify their desirable
properties under much broader model assumptions, namely a model lean
framework. Generalized CP (GCP) is proposed to estimate the prediction
errors (testing errors). It is asymptotic unbiased. We study its
properties especially the distribution of the difference among two
sub-models.
Joint work with L. Brown, J. Cai, A. Kuchibhotla and the Wharton group
Posters
Mona Azadkia, Stanford University
Title: Matrix denoising with unknown noise variance
Abstract: Click here
Stephen Bates, Stanford University
Title: Model-X Knockoffs for Graphical Models
Abstract: Modern
scientific applications require statistical methods for identifying
relevant explanatory variables from a large number of possible
explanatory variables with statistical guarantees that the number of
spurious discoveries is controlled. The model-X knockoff framework
provides false discovery rate guarantees for the selected features for
any conditional distribution of the response on the features. The
procedure requires a known distribution of the covariates X and a
knockoff sampling mechanism for this distribution. In this work, we
greatly expand the class of distributions for which model-X knockoffs
can be sampled be introducing a knockoff sampler for arbitrary
graphical models. Our proposed sampler is computationally tractable for
graphs that have low treewidth, i.e. they are not too complex.
Furthermore, we show that our sampler is able to generate knockoffs
from any valid knockoff distributions, which means that the sampler can
generate knockoffs with higher power than those from previously known
samplers.
Thomas Berrett, University of Cambridge
Title: Efficient integral functional estimation via k-nearest neighbour distances
Abstract: Click here
Ran Dai, University of Chicago
Title: Post-selection inference on high-dimensional varying-coefficient quantile regression model
Abstract: Quantile
regression has been successfully used to study heterogeneous and heavy
tailed data. In this work, we study high-dimensional
varying-coefficient quantile regression model that allows us capture
non-stationary effects of the input variables across time. We develop
new tools for statistical inference that allow us to construct valid
confidence bands and honest tests for nonparametric coefficient
functions of time and quantile. Our focus is on inference in a
high-dimensional setting where the number of input variables exceeds
the sample size. Performing statistical inference in this regime is
challenging due usage of model selection techniques in estimation.
Never the less, we are able to develop valid inferential tools that are
applicable to a wide range of data generating processes and do not
suffer from biases introduced by model selection. The statistical
framework allows us to construct a confidence interval at a fixed
point in time and a fixed quantile based on a Normal approximation, as
well as a uniform confidence band for the nonparametric coefficient
function based on a Gaussian process approximation. Joint work with Rina Foygel Barber and Mladen Kolar.
Eugene Katsevich, Stanford University
Title: Reconciling FDR control with post hoc filtering
Abstract: The
false discovery rate (FDR) is a popular error criterion for large-scale
multiple testing problems. A notable pitfall of the FDR is that
filtering (i.e. subsetting) the rejection set post hoc might invalidate
the FDR guarantee. In some applied settings, however, filtering is
standard practice. For example, post hoc filtering is often employed in
gene ontology enrichment analysis (where hypotheses have a directed
acyclic graph structure) to remove redundancy among the set of rejected
hypotheses. We propose Filtered BH, a filter-aware extension of the BH
procedure. Assuming the filter can be specified in advance, Filtered BH
takes as input this filter as well as a set of p-values and outputs a
rejection set. This rejection set, when filtered, provably controls the
FDR. Existing domain-specific filters can be easily integrated into
Filtered BH, allowing scientists to continue the practice of filtering
without sacrificing rigorous Type I error control.
Byol Kim, University of Chicago
Title: Statistical Inference for High-Dimensional Differential Networks
Abstract: Click here
John Kolassa, Rutgers
University
Title: Conditional Likelihood Techniques applied to Partial Likelihood Regression for Survival Data
Abstract: Proportional
hazards regression shares the possibility of infinite parameter
estimation with logistic and multinomial regression. This poster
demonstrates how to perform conditional inference on finite components
of the proportional hazards regression model in the presence of
infinite estimates for nuisance parameters, by employing optimization
techniques to reduce the data set to one yielding conditional inference
approximating that of the desired regression model.
Lihua Lei, UC Berkeley
Title: TBD
Abstract: TBD
Keith Levin, University of Michigan
Title: Inferring Low-Rank Population Structure from Multiple Network Samples
Abstract: In
increasingly many settings, particularly in neuroscience, data sets
consist of multiple samples from a population of networks, in which a
notion of vertex correspondence across networks is present. For
example, in the case of neuroimaging data, fMRI data yields graphs
whose vertices correspond to brain regions that are common across
subjects. The behavior of these vertices can thus be sensibly compared
across graphs. We consider the problem of estimating parameters of the
network population distribution under this setting. In particular, we
consider the case where the observed networks share a low-rank
structure, but may differ in the noise structure on their edges. Our
approach exploits this shared low-rank structure to denoise edge-level
measurements of the observed networks and estimate the desired
population-level parameters. We also explore the extent to which
complexity of the edge-level error structure influences estimation and
downstream inference.
Haoyang Liu, University of Chicago
Title: Between hard and soft thresholding: optimal iterative thresholding algorithms
Abstract: Iterative
thresholding algorithms seek to optimize a differentiable objective
function over a sparsity or rank constraint by alternating between
gradient steps and thresholding steps. This work examines the choice of
the thresholding operator. We develop the notion of relative concavity
of a thresholding operator, a quantity that characterizes the
convergence performance of any thresholding operator on the target
optimization problem. Surprisingly, we find that commonly used
thresholding operators, such as hard thresholding and soft
thresholding, are suboptimal in terms of convergence guarantees.
Instead, a general class of thresholding operators, lying between hard
thresholding and soft thresholding, is shown to be optimal with the
strongest possible convergence guarantee among all thresholding
operators.
Miles Lopes, UC Davis
Title: Bootstrapping spectral statistics in high dimensions
Abstract: Spectral
statistics play a central role in many multivariate testing problems.
It is therefore of interest to approximate the distribution of
functions of the eigenvalues of sample covariance matrices. Although
bootstrap methods are an established approach to approximating the laws
of spectral statistics in low-dimensional problems, these methods are
relatively unexplored in the high-dimensional setting. The aim of this
paper is to focus on linear spectral statistics (LSS) as a class of
"prototype statistics" for developing a new bootstrap method in the
high-dimensional setting. In essence, the method originates from the
parametric bootstrap, and is motivated by the notion that, in high
dimensions, it is difficult to obtain a non-parametric approximation to
the full data-generating distribution. From a practical standpoint, the
method is easy to use, and allows the user to circumvent the
difficulties of complex asymptotic formulas for LSS. In addition to
proving the consistency of the proposed method, we provide encouraging
empirical results in a variety of settings. Lastly, and perhaps most
interestingly, we show through simulations that the method can be
applied successfully to statistics outside the class of LSS, such as
the largest sample eigenvalue and others. (Joint work with Alexander
Aue and Andrew Blandino.)
Yet Nguyen, Old Dominion University
Title: Identifying relevant covariates in RNA-seq analysis by pseudo-variable augmentation
Abstract: RNA-sequencing
(RNA-seq) technology enables the detection of differentially expressed
genes, i.e., genes whose mean transcript abundance levels vary across
conditions. In practice, an RNA-seq dataset often contains some
explanatory variables that will be included in analysis with certainty
in addition to a set of covariates that are subject to selection. Some
of the covariates may be relevant to gene expression levels, while
others may be irrelevant. Either ignoring relevant covariates or
attempting to adjust for the effect of irrelevant covariates can be
detrimental to identifying differentially expressed genes. We address
this issue by proposing a covariate selection method using
pseudo-covariates to control the expected proportion of selected
covariates that are irrelevant. We show that the proposed method can
accurately choose the most relevant covariates while holding the false
selection rate below a specified level. We also show that our method
performs better than methods for detecting differentially expressed
genes that do not take covariate selection into account, or methods
that use surrogate variables instead of the available covariates.
Chathurangi Pathiravasan, SIU Carbondale
Title: Bootstrapping hypotheses tests
Abstract: Click here
Cornelis Potgieter, Southern Methodist University
Title: Simulation-Selection-Extrapolation: Estimation for High Dimensional Errors-in-Variables Models
Abstract: Errors-in-variables
models in a high-dimensional setting present a two-fold challenge: The
presence of measurement error in the covariates can result in severely
biased parameter estimates, while the high-dimensional nature of the
data can obscure the covariates that are relevant to the outcome of
interest. A new estimation procedure called SIMSELEX
(SIMulation-SELection-EXtrapolation) is proposed. This procedure
augments the traditional SIMEX approach with a variable selection step
based on the group lasso. The SIMSELEX approach is shown to perform
well in variable selection and has significantly lower estimation error
than the naive estimator that ignores measurement error. Furthermore,
SIMSELEX can be applied in a variety of errors-in-variables settings,
including linear regression, logistic regression, and the Cox
proportional odds model. The SIMSELEX procedure is compared to the
matrix uncertainty selector and the conic programing estimator for a
linear model, and to the generalized matrix uncertainty selector for a
logistic regression model. Finally, the method is applied to analyze a
microarray dataset that contains gene expression measurements of
favorable histology Wilms tumors.
Martin Spindler, University of Hamburg
Title: Uniform Inference in High-Dimensional Gaussian Graphical Models
Abstract: Graphical
models have become a very popular tool for representing dependencies
within a large set of variables and are key for representing causal
structures. We provide results for uniform inference on
high-dimensional graphical models with the number of target parameters
being possible much larger than sample size. This is in particular
important when certain features or structures of a causal model should
be recovered. Our results highlight how in high-dimensional settings
graphical models can be estimated and recovered with modern machine
learning methods in complex data sets. We also demonstrate in
simulation study that our procedure has good small sample properties.
Joint work with Jannis Kuck an Sven Klaassen. A paper which might be of
interest for the literature section on the workshop webpage is the
following paper: https://www.annualreviews.org/doi/abs/10.1146/annurev-economics-012315-015826
resp.
https://arxiv.org/abs/1501.03430
Lei Sun, University of Chicago
Title: Empirical Bayes Normal Means with Correlated Noise
Abstract: Recent
technological advances have allowed scientists to perform large-scale
simultaneous inference on ever-growing massive data sets. Many of these
pursuits can be formulated statistically as multiple testing in the
classic high-dimensional normal means problem, and a variety of methods
have been developed in the past decade, among which empirical Bayes is
a viable tool commonly applied. However, like many other multiple
testing methods, this approach is prone to distortion by correlation
which is ubiquitous in real-world statistical analysis. We develop
Correlated Adaptive Shrinkage (CASH) to account for unknown
correlation, detect elusive signals, and control false discoveries. Our
methodology compares favorably in realistic simulations and real data
analyses with popular multiple testing methods and sheds new light on
the effect of correlation. Joint work with Matthew Stephens.
Zhipeng Wang, Genentech
Title: TBD
Abstract: TBD
Andrew Womack, Indiana University
Title: Horseshoes with heavy tails
Abstract: Locally
adaptive shrinkage in the Bayesian framework is achieved through the
use of local-global prior distributions that model both the global
level of sparsity as well as individual shrinkage parameters for mean
structure parameters. The most popular of these models is the Horseshoe
prior and its variants due to their spike and slab behavior involving
an asymptote at the origin and heavy tails. In this paper, we present
an alternative Horseshoe prior that exhibits both a sharper asymptote
at the origin as well as heavier tails, which we call the Heavy-tailed
Horseshoe prior. We prove that mixing on the shape parameters provides
improved spike and slab behavior as well as better reconstruction
properties than other Horseshoe variants. Joint work with Zikun Yang.
Chiao-Yu Yang, UC Berkeley
Title: TBD
Abstract: TBD
Qiyiwen Zhang, Washington
University in St. Louis
Title: Bayesian variable selection and frequentist post-selection inference
Abstract: TBD