Confidence Distribution (CD) and Efficient Approaches for Combining Inferences from Massive Complex Data

$442,227FY2015MPSNSF

Rutgers University New Brunswick, New Brunswick NJ

Investigators

Abstract

Modern powerful data acquisition technology has greatly facilitated the collection of massive data, often with heterogeneous and complex structures, in many domains. Those data are used for drawing inferences for scientific discoveries or marketing values. In practice, the data often involve multiple studies or subpopulations, each with its own data source, targeting the same hypotheses or parameters. In such cases, coherent and efficient overall inference methods for combining findings from individual studies would be needed. The need for efficient combining inferences also arises in the implementation and development of different algorithmic or high-performance computing methods in dealing with big data. The goal of this project is to apply the statistical concept of confidence distribution to developing novel efficient approaches for combining multiple inferences from different sources and in massive complex data settings. Such approaches should be timely and useful for many domains, including health and medicine, market analysis, information retrieval, aviation safety, homeland security, just to name a few. This project builds on the recent exciting developments from the so-called "confidence distribution" to develop fusion learning for massive data. It focuses on three specific developments: 1) Efficient nonparametric fusion learning: an efficient nonparametric approach for combining individual inferences from multiple studies that has implementable algorithms and full theoretical support. The development is nonparametric and data driven, which is broadly applicable with little model assumptions. 2) Efficient fusion learning for the split-conquer-combine approach for handling massive and possibly heterogeneous data: This research utilizes the idea of parallel computing to develop several split-conquer-combine schemes for analysis of massive data. The approach can reduce substantially the computational expenses and yet still achieve the oracle inference outcome associated with the entire data. 3) Fusion learning in prediction and testing: The research develops and generalizes the theoretical framework of inference for prediction and testing based on confidence distributions. This development helps mitigate several well known difficulties surrounding multiple testing, model selection problems, especially in the setting of big data. Overall, the project involves in-depth theoretical development and real problem solving in complex data. It is ideally suited for collaborative research and active participation from students.

View original record on NSF Award Search →