IIBR Informatics: Taming Complexity Through Simulations: Scalable Inference Under the Coalescent with Recombination
William Marsh Rice University, Houston TX
Investigators
Abstract
Species have been evolving, diverging, and adapting to their environments for billions of years. While we have no direct access to the history of species, their genomes provide much signal that allows us to reconstruct this history. Understanding the evolution of genomes helps shed light on how species evolve and diverge, how genes emerge and evolve, and how traits evolve. However, the evolution of genomes is a very complex process that results in scenarios where different regions in the genomes have different evolutionary histories. One process that leads to such a scenario is recombination. This project aims to develop methods for inferring evolutionary histories of genes and genomes in the presence of recombination. Currently this task is not doable for large sets of genomes due to challenges with deriving mathematical models and computationally feasible inference solutions. This project will enable this task by allowing for automatically deriving and inferring the evolutionary history of a set of genomes in the presence of recombination. The project will support graduate student and post-doc mentoring, and will allow for broadening participation in computing, especially given its interdisciplinary nature. Results obtained by this project will facilitate new types of genomic analyses and, consequently, biological discoveries. The aim of this project is to devise methods that make practical and scalable the inference of evolutionary histories (topologies and parameters) under a model called the multispecies coalescent with recombination and migration (MSC-RM). This model allows for analyzing data that consists of genomics sequences from different species and different individuals within species while accounting simultaneously for recombination, incomplete lineage sorting, and gene flow, in addition to various models of DNA sequence evolution. For inferring the topology of the species phylogeny, a deep learning approach is taken, where a neural network is trained on simulated data. For inferring the phylogeny’s parameters (divergence times and population sizes), a hidden Markov model is built from simulated data, and a proxy to the likelihood is computed by means of the quadratic Forward algorithm. This combination of novel techniques helps achieve automated and scalable inference under the MSC-RM model. All methods will be implemented and made publicly available in open source, and all results will be disseminated via publications, public lectures, and tutorials. Results of this project will be available at http://bioinfocs.rice.edu. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →