Mathematical Models and Statistical Methods for Large-Scale Population Genomics

$303,504R01FY2015GMNIH

University Of California Berkeley, Berkeley CA

Investigators

Linked publications & trials

Abstract

? DESCRIPTION (provided by applicant): Technological advances in DNA sequencing have dramatically increased the availability of genomic variation data over the past few years. This development offers a powerful window into understanding the genetic basis of human biology and disease risk. To facilitate achieving this goal, it is crucial to develop efficient analytical methods that will allow researchers to more fuly utilize the information in genomic data and consider more complex models than previously possible. The central goal of this project is to tackle this important challenge, by carrying out te following Specific Aims: In Aim 1, we will develop efficient inference tools for whole-genome population genomic analysis by extending our ongoing work on coalescent hidden Markov models and apply them to large-scale data. The methods we develop will enable researchers to analyze large samples under general demographic models involving multiple populations with population splits, migration, and admixture, as well as variable effective population sizes and temporal samples (ancient DNA). Multi-locus full-likelihood computation is often prohibitive in most population genetic models with high complexity. To address this problem, we will develop in Aim 2 a novel likelihood-free inference framework for population genomic analysis by applying a highly active area of machine learning research called deep learning. We will apply the method to various parameter estimation and classification problems in population genomics, particularly joint inference of selection and demography. In addition to carrying out technical research, we will develop a useful software package that will allow researchers from the population genomics community to utilize deep learning in their own research. It is becoming increasingly more popular to utilize time-series genetic variation data at the whole-genome scale to infer allele frequency changes over a time course. This development creates new opportunities to identify genomic regions under selective pressure and to estimate their associated fitness parameters. In Aim 3, we will develop new statistical methods to take full advantage of this novel data source at both short and long evolutionary timescales. Specifically, we will develop and apply efficient statistical inference methods for analyzing time-series genomic variation data from experimental evolution and ancient DNA samples. Useful open-source software will be developed for each specific aim. The novel methods developed in this project will help to analyze and interpret genetic variation data at the whole-genome scale.

View original record on NIH RePORTER →