Statistical inference in genome-wide association and sequencing studies

$421,414R01FY2015HLNIH

Fred Hutchinson Cancer Research Center, Seattle WA

Investigators

Linked publications & trials

Abstract

DESCRIPTION (provided by applicant): Despite the success of genome-wide association studies to identify over hundreds of loci that are associated with common and complex diseases, significant challenges remain for statistical inference in these high- dimensional data. Specifically, rare variants generated by emerging genome-wide sequencing studies may explain the missing heritability, but pose a challenge to the traditional locus-by-locus approach. Studies of gene-environment interactions have not generated many successes, possibly due to limitations of existing analytical methods. Mediation of genetic effects by intermediate outcomes is an emerging topic of interest that may lead to disease prevention or treatment. The existing statistical methods for inferring mediation effect, however, have been underdeveloped. In this proposal, we plan to build novel statistical methods to address these challenges. The methodological research is motivated by, but not limited to, the genome-wide association studies and the sequencing project in the Women's Health Initiative (WHI), including the Genomics and Randomized Trials Network (GARNET), Population Architecture of Genes and Environment (PAGE) and the Exome Sequencing Project (ESP). The feature of this proposal is that the PI and co-investigators are indeed conducting these studies, thus methodological innovations proposed will be applied immediately to address scientific questions of interest. A number of statistical methods for rare variant analysis have been proposed recently. None of the existing methods accounts for the presence of neutral variants, i.e., alleles which do not have functional influence on the trait. Inclusion of neutral variants in the aforementioned gene-set tests certainly dilutes power. In this proposal, we propose a class of finite mixture models that explicitly teases out neutral variants to improve power. The main challenge in identifying gene-environment interactions is lack of power due to limited sample size and typically small magnitude of interactions. Dimension reduction, such as gene-set based inference, is critical to reduce the amount of hypothesis tests and enrich weak genetic effects. We will develop a suite of gene-set based, two-stage filtering procedures for detecting gene-environment interaction. We will also develop a multivariate sparse gene-set testing framework with a L1 penalty to assemble weak genetic effects in a gene or a pathway. The difficulty in inferring mediation of genetic effects on diseases by intermediate outcomes is how to control for unknown confounders. Current approaches exploit Mendelian Randomization, the random segregation of alleles, and use known genetic risk alleles as instrumental variables to infer causality. Limitations of the existing framework, mainly on overly restrictive assumptions and inability to model the causal effect on binary outcomes, have impeded applicability of such inference. We will revamp the instrumental variable framework originally developed in econometrics to fit better to genetic studies.

View original record on NIH RePORTER →