Empirical Process Theory for Complex Statistical Data Integration

$204,446FY2020MPSNSF

University Of Maryland, College Park, College Park MD

Investigators

Abstract

Nowadays, every organization collects various data sets from numerous sources. If these data sets are combined, improved quality of inference will accelerate scientific discovery. Statistical analysis of merged data is, however, challenging because each data set often represents only a part of the entire target population and because combined data contain unidentified duplicated records from data sets which share data sources partially. This research provides theoretical and methodological foundations to address the issue of unavoidable bias in data integration arising from heterogeneity and duplication in merged data. With the proposed data integration technique, previously limited findings to smaller populations are combined to be generalized to a broader population. The proposed methodology serves well for privacy protection by avoiding record linkage that identifies duplication through private information. Another benefit is to overcome the shortage of relevant information in individual data sources without collecting costly(and possibly small) independent and identically distributed data all over again. Expected outcomes from this project will encourage the efficient and socially proper use of massive data in modern data analysis. The graduate student support will be used on interdisciplinary activities and writing codes. The project delves into the intersection of empirical process theory, semi- and non-parametric inference, and sampling theory. Existing theory and methods fail to provide sufficient tools to study complex data integration problems characterized by bias and dependence due to heterogeneity and duplication. Inverse probability-weighted empirical process theory requires a special independence structure on weights and variables. Semi- and non-parametric inference often relies on the availability of the independent and identically distributed sample. Sampling theory handles dependence in a specific design but focuses on a parametric model without accounting for randomness in collected variables in a finite population framework. To address the paucity of probabilistic tools and techniques, the PI will develop a unified framework in connection with a weighted empirical process motivated by multiple frame surveys. This weighted empirical process is computable without identifying duplicated selections. The proposed tools and techniques will play a critical role in studying a general sample selection and missing data mechanisms such as a convenience sample, semiparametric estimation with misspecified models, and multiple observations for duplicated subjects in overlapping data sources. The particular problems under investigation include (a) uniform limit theorems under general missingness mechanisms, (b) robust M-estimation under model misspecification for data integration, and (c) general theory to integrate multiple probability measures that correspond to heterogeneous data sources. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →