"On Data Reduction of Big Data"

Speaker: Min Yang, University of Illinois-Chicago

Abstract: The big data paradigm has drawn a significant amount of attention in recent years as costs of acquiring and storing data have plummeted. Instead, bottlenecks have been shifted to fast and in-depth analysis. However, this shift has created its own set of problems, the most obvious one is that large datasets are often computationally expensive to process. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in Big Data analysis is data reduction. In this presentation, I will review some existing approaches in data reduction and introduce a new strategy called information-based optimal subdata selection (IBOSS). Under linear and nonlinear models set up, theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to other approaches in term of parameter estimation and predictive performance. The tradeoff between accuracy and computation cost is also investigated. When models are mis-specified, the performance of different data reduction methods are compared through simulation studies. Some ongoing research work as well as some open questions will also be discussed.

 

Host: Nan Lin

 

Tea will be served in room 200 @ 3:45