Towards precision risk stratification, diagnosis, and treatment: statistical and computational machinery for synthesizing information across massive, diverse sources of genomic and clinical data

$426,479R35FY2025GMNIH

Johns Hopkins University, Baltimore MD

Investigators

Abstract

Abstract/Project Summary Biobanks, electronic health records, and clinical registries have collectively amassed a massive amount of data on peopleâs genetic makeups and their clinical trajectories. Information in these data sources, if properly synthesized, provides us an opportunity to identify treatments precisely tailored toward individual patient profiles. However, heterogeneity across sites in patient populations, data encoding procedures, and clinical practices pose significant statistical challenges, compounded by computational challenges these massive data pose. We will meet these statistical challenges by combining Bayesian hierarchical models, biomedical ontological systems, and high-dimensional transfer learning methods: these tools will allow us to bring together the larger amount of data from diverse sources to investigate biomedical questions beyond the scope of any single source. We will scale the developed statistical framework to sizes of modern genetic and clinical data by combining state-of-the-art computational algorithms with hardware optimization techniques, including use of graphical processing units. We will deliver the end products as open-source software packages for deployment in research and clinical settings. We will demonstrate the proposed statistical and computational innovations through direct applications to: genetic risk predictions for under-represented populations and subgroup effect estimation for comparative effectiveness studies of type-2 diabetes treatments using a federated network of healthcare databases. The machinery is more broadly applicable, however; it for example allows integrating insights across data sources that provide different but functionally-related metabolites and/or proteins. The proposed development thus constitutes an important step towards extracting more scientific insights on disease mechanisms and effective treatment strategies by synthesizing information across heterogeneous data sources.

View original record on NIH RePORTER →