Collaborative Research: Development of New Statistical Methods for Genome-Wide Association Studies

$157,024FY2020MPSNSF

Virginia Polytechnic Institute And State University, Blacksburg VA

Investigators

Abstract

Advances in high-throughput sequencing technologies now make possible cost-effective analysis of whole genomes. The genomes of any two humans are 99.9% identical, with differences in the remaining 0.1% determining the diversity of human traits. For example, DNA sequence differences account for 80% of the variability in human height. Current technology allows the identification of these sequence polymorphisms between individuals, which can then be correlated to differences in a given trait. When done on a genome wide level with a large population of individuals, such genome wide association studies (GWASes) can be a useful tool for the identification of key genes controlling specific traits. However, a requirement for this approach is the availability of powerful and accurate statistical and computational methods to search through a massive amount of sequencing data to correctly identify DNA differences associated with the phenotypic trait of interest. The outcome of the project will (1) provide statistical methods to understand relationships between DNA sequence differences and the full range of diversity observed in a population, and (2) provide corresponding computational tools suitable for use by biologists and biomedical specialists for their specific population studies. This research project will produce intermediate methodological and theoretical results that lay the foundation for the final output. This project will also apply the developed methods to real, experimental data to demonstrate their utility. In addition to these research outcomes, the project will support the training of students in the field, including women and underrepresented minorities. GWAS estimates the correlation between phenotypic traits and sequence polymorphisms to identify genetic variants highly associated with specific traits. Single nucleotide polymorphisms (SNPs) are the most common type of genetic variant, and sequencing technologies allow for large-scale collection of SNP information. The project team will develop new GWAS models and methods to find trait-affecting variants with more power and accuracy. Specifically, the new methods developed in this research project will improve existing approaches by allowing modeling of observed traits from any probabilistic distribution in the exponential family. This extension ensures statistical models are biologically meaningful and interpretable. Second, the new methods will exploit different Bayesian priors, especially contemporary Bayesian priors for ultra-high dimensional model selection, that will share information across the entire genome for stable statistical inferences. Theoretical results of Bayesian priors in these new methods will also be developed. Third, a stochastic search algorithm will be developed to efficiently search through the massively large model space for model selection. This ensures that new methods are practical and useful since analysis can be done within a reasonably short time frame. Meanwhile, this also eliminates the use of subjective thresholds of significance that are now commonly used but an embarrassing practice in GWAS, having no theoretical support. Methods will be implemented into software tools and will be freely available for statisticians, biologists, and biomedical researchers. This project is funded jointly by the Division of Mathematical Sciences Mathematical Biology Program and the Statistics Program. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →