Estimating the Bayesian Phylogenetic Information Content of Systematic Data

$600,000FY2014BIONSF

University Of Connecticut, Storrs CT

Investigators

Paul O Lewiscontact Louise A Lewis Lynn Kuo Ming-Hui Chen

Abstract

This research will develop and test new analytical methods for estimating the quality and information content of biological data sets used to determine genealogical relationships (phylogenies) among species. Phylogenies are crucial in many areas of biology, from identifying emerging pathogens to studying the mechanisms of evolutionary change and the functioning of communities of species in nature. The methods to be developed in this research project will be announced in scientific journals, and will be provided in free, open-source software enabling researchers to apply the new methods to their own data. This project will also facilitate interdisciplinary training of postdoctoral associates, graduate students and faculty in biology and statistics. High school students will be involved in developing mobile apps that will provide useful, free tools to the scientific community. The Bayesian statistical framework is widely used in phylogenetics and molecular evolution; however, the means for estimating the information content of data remains poorly developed. One reason for this may be the fact that marginal likelihoods and topology posterior probabilities are required for estimating Kullback-Leibler (KL) divergences, and accurate means of estimating these quantities have only recently been achieved. Correspondingly, primary objectives for this research are: (1) Evaluate the utility of KL-based information content measurement to answer a diversity of questions important to systematists, including: (a) How much information about tree topology is present in a data set? (b) Can topological information be separated from information about substitution model parameters? (c) How much information does one data subset have compared to a different data subset? (d) Do two data subsets contain conflicting phylogenetic information? (e) How much information is there about particular model parameters (e.g. a divergence time of interest)? (f) How much information is there for resolving particular clades? (2) Explore issues related to information content, such as: (a) a topological information content definition of saturation; (b) a KL-based method for estimating variable-tree marginal likelihoods for purposes of model selection; and (c) polytomy analyses. (3) Implement KL estimation of information content in existing Bayesian phylogenetics software to make it freely available to the systematics community. These objectives will be pursued using the variable tree IDR method to provide accurate estimates of all marginal likelihoods needed for KL estimation. The recently published conditional clade method allows accurate posterior probabilities of tree topologies to be estimated from sample conditional clade frequencies. Computer simulation experiments and analyses of relevant real world data sets will be used to evaluate the effectiveness of KL in measuring information content and in providing answers to the questions posed above.

View original record on NSF Award Search →