DMS/NIGMS 2: Statistical Methods and Computational Algorithms for Biobank Data

$971,814FY2021MPSNSF

University Of California-Los Angeles, Los Angeles CA

Investigators

Abstract

Biobank data is characterized by its volume, velocity, variety, and veracity (4V). Two prime examples are the Million Veteran Project (MVP) at US Veterans Affairs (VA) and UK Biobank. The data are big, with up to a million subjects and occupying terabytes of storage (volume). Their sample sizes and data content keep increasing (velocity). They contain heterogeneous sources of information: genome, electronic health record (EHR), wearable devices, images, and most recently, COVID-19 data (variety). Furthermore, they are fraught with missingness and inaccuracy (veracity). This project seeks to develop novel statistical methods and computational algorithms that address specific aspects of 4V. The methods are motivated by the principal investigators' recent experience in analyzing MVP and UK Biobank data, and are generalizable to any biobank or other generic big data. The methods provide solutions to some of the most pressing issues in biobank data analysis. The work will push forward several frontiers in statistics, optimization, and genetics. The research will be integrated with substantial education and outreach activities, including developing new courses and software and mentoring students. These activities aim to expose a diverse set of students, including women and minorities, to state-of-the-art statistical and computational techniques for big data analysis. Three sets of problems are to be investigated. (1) Electronic health records and wearable devices generate a vast amount of longitudinal data in biobanks. In many studies, the within-subject variability of a longitudinal outcome is the primary scientific interest. Motivated by studies of the impacts of blood pressure variability and glycemic variability on diabetes complications, the PIs propose a robust and scalable method for the estimation and inference of the effects of both time-varying and time-invariant predictors on within-subject variance. Compared to existing approaches, the method is robust to the distribution misspecification and orders of magnitude faster. Computational scalability makes it a powerful tool for studying trait variability based on massive longitudinal data in biobanks. (2) The PIs will develop a new class of online learning algorithms, which combine the majorization-minimization principle in statistics and the stochastic proximal iteration algorithm. The new algorithms apply to a broader class of models and are demonstrably more stable and robust. They help solve the volume issue and will be applied to genome-wide association studies of massive biobank data. (3) The PIs propose a bag of little bootstraps (BLB) approach for estimating massive variance component models, which play a central role in genetics and biostatistics. Fitting such models is prohibitive for biobank data because of the inversion of the giant covariance matrix. The BLB approach breaks the massive variance component model into many smaller ones, which are bootstrapped in parallel and then averaged. The new method will enable quantifying heritability and genetic correlation of complex traits in biobank data. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →