Inferring Species Relationships under the Network Multispecies Coalescent Model: Theory and Practical Methods
University Of Alaska Fairbanks Campus, Fairbanks AK
Investigators
Abstract
Because the genome of an organism is a product of its evolutionary history, evidence of the relationships between different organisms can be found in the similarities and differences between their sequenced genomes. However, many biological processes have shaped the genomes in complicated ways, so extracting relationships from genomic data is far from straightforward. Hybridization and other forms of gene transfer between lineages pose a particular challenge, as a simple depiction of organism relationships by an evolutionary tree becomes inadequate, with a web-like network more accurately describing their history. Current computational methods for analyzing genomes in this setting are inefficient, are only able to process small datasets, and leave unexamined the wealth of potential information even in already collected data. In this project, new computational and statistical methods to infer network-like relationships will be developed and implemented in practical and efficient software tools. Based on detailed mathematical models of evolving genomes, and implemented for easy use by scientists, these new methodologies will yield sound statistical conclusions. They will be applicable to a broad range of biological studies, such as for understanding the network of life, agriculturally important crops, conservation biology, and biomedically relevant pathogens. Further project work includes interdisciplinary training of graduate students in mathematics and biology and outreach to high school students through a summer research academy. While network relationships between organisms lead to the evolutionary history of individual genetic loci being represented by different trees, this signal of reticulation is confounded with that of incomplete lineage sorting, a population genetic process that produces differing gene trees even when the species-level relationships are tree-like. These processes will be investigated jointly, through use of the mathematical network multispecies coalescent model. Since its use with standard likelihood and Bayesian statistical approaches to data analysis pose excessive computational demands, this project will develop faster, more feasible approaches. One difficulty arises from the size of the space of potential networks, which is vastly larger than the already large space of trees. By utilizing new distance-based approaches, however, this project develops methods circumventing slow explorations of network space, while still obtaining statistically-consistent network estimates. In particular, quartet and rooted-triple based network distances will be used to leverage the inference of small network units to yield large network estimates through appropriate interpretation of a splits graph. Subprojects include development of coalescent-based network inference from concatenated genomic sequences, thereby avoiding inference of individual gene trees; extension of theory and practice beyond the class of level-1 networks; and better addressing covariances between quartet or rooted-triple statistics arising from the model. This project is jointly funded by the Mathematical Biology program and Life Science Venture funds in DMS, and the Established Program to Stimulate Competitive Research (EPSCoR). This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →