CAREER: New data integration approaches for efficient and robust meta-estimation, model fusion and transfer learning

$172,186FY2024MPSNSF

North Carolina State University, Raleigh NC

Investigators

Abstract

Statistical science aims to learn about natural phenomena by drawing generalizable conclusions from an aggregate of similar experimental observations. With the recent “Big Data” and “Open Science” revolutions, scientists have shifted their focus from aggregating individual observations to aggregating massive publicly available datasets. This endeavor is premised on the hope of improving the robustness and generalizability of findings by combining information from multiple datasets. For example, combining data on rare disease outcomes across the United States can paint a more reliable picture than basing conclusions only on a small number of cases in one hospital. Similarly, combining data on disease risk factors across the United States can distinguish local from national health trends. To date, statistical approaches to these data aggregation objectives have been limited to simple settings with limited practical utility. In response to this gap, this project develops new methods for aggregating information from multiple datasets in three distinct data integration problems grounded in scientific practice. The developed approaches are intuitive, principled and robust to substantial differences between datasets, and are broadly applicable in medical, economic and social sciences, among others. Among other applications, the project will deliver new tools to extract health insights from large electronic health records databases. The project will support undergraduate and graduate student training, course development, and the recruitment and professional mentoring of under-represented minorities in statistics. Further, the project will impact STEM education through a data science teacher training program in underserved communities. This project develops intuitive, principled, robust and efficient methods in three essential data integration problems: meta-analysis, model fusion and transfer learning. First, the project delivers a set of meta-analysis methods for privacy-preserving one-shot estimation and inference using a new notion of dataset similarity. The primary novelty in the approach is the joint estimation of both dataset-specific parameters and a combined parameter that bears some similarity to the classic meta-estimator. Second, the project establishes model fusion methods that learn the clustering of similar datasets. The methods’ unique feature is a model fusion that dials data integration along a spectrum of more to less fusion and thereby does not force model parameters from clustered datasets to be exactly equal. Third, the project develops flexible and robust transfer learning approaches that leverage historical information for improved statistical efficiency in a target dataset of interest. An important element of these approaches is a flexible specification of the type of models fit to the source datasets. All three sets of methods place a premium on interpretability, statistical efficiency and robustness of the inferential output. The project unifies the three sets of proposed methods under a formal data integration framework formulated around two axioms of data integration. Data integration ideas pervade every field of scientific study in which data are collected, and so the research contributes to scientific endeavors in the medical, economic and social sciences, among others. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →