Statistical Design, Sampling, and Analysis for Large Scale Experiments

$129,450FY2019MPSNSF

Illinois Institute Of Technology, Chicago IL

Investigators

Abstract

In the big data paradigm, even the controlled experiments can become large-scale, in the sense that the sample size is massive, and the dimension of the input variables is high. Such "big data" problem challenges many statistical approaches and significantly increases the amount of computation in estimation and inference. In this project, the PI focuses on specific instances of large-scale experiments and develops a set of novel theories and methodologies on experimental design, sampling, and analysis. The research has two major parts. In Part 1, the PI focuses on the type of experiments that contains a large dimension of covariate variables. For example, in a clinical trial, the covariates can be patients' rich medical history. How should the treatment settings be assigned to each patient? The PI provides the answer through a general experimental design framework so that the treatment effects are estimated accurately despite the influence of the covariates. In Part 2, the PI focuses on the Gaussian Process (GP) regression, one of the most popular statistical learning tools. The computation required is prohibitive for analyzing large-scale experiments such as the climate model simulations. The PI develops a dimension reduction framework and an active learning method that significantly improves the efficiency and accuracy of the GP model. Three major methodologies are considered. In Parts 1-3, the PI introduces a new discrepancy-based design to achieve covariate balance for experiments with a large dimension of covariates. The discrepancy criterion also has appealing theoretical properties that lead to a more accurate estimation of the parameters including both treatment effects and covariates' effects. Optimal design algorithms are developed for both offline and online experiments. In Part 4, the PI develops a novel dimension reduction method that finds the optimal convex combination of low-dimension kernel functions for the GP model. It is shown that the proposed method is a significantly less computational and more accurate approximation of certain types of underlying functions. In Part 5, an active learning method based on the generalized Cook's Distance is developed for the GP regression. It is more efficient than the standard random sampling method. The research is novel in ideas, rigorous in theories, and useful in practice, and will open new directions in the statistical design and analysis of experiments area. The PI has a detailed education plan to develop new course modules, tutorials, and workshops based on the research products from this project. The research outcomes are readily applicable to a variety of scientific, engineering, medicine and other fields where large-scale data collection and analysis are demanded. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →