Improved Bayesian phylogenetic inference based on approximate conditional independence

$279,808FY2014BIONSF

University Of Wisconsin-Madison, Madison WI

Investigators

Abstract

The proposed research will use a new mathematical method, together with DNA sequence data, to develop software for determining genealogical relationships (phylogeny) among species of organisms. This will be done with greater accuracy and speed than has been possible in the past. This work is important because phylogenies have many valuable applications. In addition to their central importance in studying the history of life on Earth, phylogenies are used by scientists who study viruses that affect human health, such as HIV and influenza, by scientists who want to predict how species will react to a changing climate, and even by scientists analyzing forensic evidence in court cases. This research intends to improve a broad class of computational tools that many scientists use when they study evolutionary history and mechanisms. Among the most accurate statistical methods for reconstructing evolutionary relationships are likelihood-based methods which model the random process by which DNA sequences evolve over time. These methods, however, are highly computationally intensive, especially when there are both large numbers of species and long DNA sequences, as is now common. The proposed research includes the formation of a team to design, implement, test, and distribute novel algorithms and software for the purpose of transforming the standard current approaches to Bayesian phylogenetic inference. The current state of the art for Bayesian phylogenetic inference involves the use of Markov chain Monte Carlo (MCMC) methods to sample phylogenetic trees from a posterior distribution that describes evolutionary histories of related species informed by DNA sequences measured in living individuals. By design, MCMC produces dependent samples where each tree sampled in sequence is likely to be very similar, or even exactly the same, as the previous sampled tree. Extremely large samples are needed to obtain samples representative of the full posterior distribution, and the computational burden becomes prohibitive. This research will exploit a recent discovery of a new method to estimate with high accuracy the posterior probabilities of trees on the basis of conditional clade distributions rather than simple sample frequencies. A consequence of this result is the possibility of obtaining truly independent samples of phylogenetic trees from a distribution that approximates the true distribution closely, and to obtain correct inference from the desired posterior distribution via importance sampling. The objectives are to fully develop and test this sampling approach with efficient algorithms that will be shared with the public and to implement the approach in new free and open source software.

View original record on NSF Award Search →