Algorithms to identify non-coding mutational burden and disease-relevant pathways

$272,474R56FY2019DKNIH

University Of Pennsylvania, Philadelphia PA

Investigators

Linked publications, trials & patents

Abstract

PROJECT SUMMARY/ABSTRACT Type 2 diabetes mellitus (T2D) is a disease with complex, polygenic etiology with numerous contributing mechanisms. To devise new therapeutics to combat this epidemic, we need to identify causal genes and variants for T2D and related, cardiometabolic traits implicated by human genetic association. Recently, large- scale DNA biobanks attached to electronic health records have facilitated extensive phenotyping in surprising large sample sizes (>500,000 subjects), importantly in diverse ancestries. These data have enabled a dramatic expansion of the number of bona-fide associations for T2D and related traits. While the increase in statistical power is certainly welcome, new opportunities for how to use these data require new computational methods and analytical pipelines. In this renewal, we focus on three areas for new methods development, which we will create and subsequently deploy to accelerate the genetic dissection for cardiometabolic disease. First, the number of associated loci now available permit the opportunity to learn directly from the data, which non- coding sequences functionally relate to T2D risk. We propose to utilize techniques in machine learning to make predictions for T2D and related causal traits, used to identify and prioritize causal variants and functional elements that are disease-predictive. A second challenge is that the quantity and pace at which this data is being produced is outstripping the rate at which even highly expert quantitative scientists can explore and extract novel insights from the data. To combat this problem, we propose to develop an informatics toolkit with apps to perform compute-intensive, important analyses and visualization with these data, tethered to cloud- based or local computation infrastructure. Finally, one key observation that follows biobank-based data analysis is that, at each physically distinct associated locus, numerous additional conditionally independent associations segregate nearby. This series of alleles can be identified through existing methods, but their use in causal inference approaches (i.e., Mendelian Randomization) has not been extensively explored. Here, we will evaluate their utility and develop statistical pipelines to use this spectrum of variation to perform new causal inference studies.

View original record on NIH RePORTER →