Collaborative Research: Design-Based Optimal Subdata Selection Using Mixture-of-Experts Models to Account for Big Data Heterogeneity

$149,961FY2022MPSNSF

University Of North Carolina Greensboro, Greensboro NC

Investigators

Abstract

With technological advances, it has become easy to collect massive amounts of data for most areas of research. But with the size of datasets measured in terabytes or even petabytes, analyzing such datasets can become an expensive computational challenge and may be impossible on a typical desktop or laptop computer. However, for making impactful discoveries, it may be unnecessary to analyze an entire dataset. Consequently, there is great interest in developing and studying methods for selecting a subset from a massive dataset and for drawing conclusions based on the much smaller selected dataset. Such methods are known as subdata selection or subsampling methods. One obvious subsampling method consists of randomly selecting data from the entire dataset. While this is often the simplest and fastest option, it has been established that better options are often available. In this project, the principal investigators (PIs) aim to develop and study a rigorous framework and new methods for optimal subdata selection by using models that account for heterogeneity in the data, which is often present in large datasets. Research findings will be incorporated in topical courses to train graduate students in large-scale data analysis. The work will also be disseminated via the PIs’ collaborations in public health, biomedical science, and business. Rather than assuming a multiple regression model, the PIs plan to develop and study subdata selection methods based on mixture-of-experts (ME) models, which can account for heterogeneity in the data. The PIs will initially develop and study subdata selection methods for a subclass of the ME models, known as clusterwise linear regression models, for which the gate functions are constant. This will be followed by studying logistic-normal mixture models, in which the gate functions depend on the regression variables. For both cases, the investigators plan to develop information-based optimal subdata selection methods, first for continuous response variables and then for binary response variables, study their statistical properties, and develop efficient algorithms for the methods that will be made available in an R package. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →