Experimental and Theoretical Analyses of Tree Distance Distributions
Cuny City College, New York NY
Investigators
Abstract
Large data sets arise in a wide range of settings across scientific and engineering disciplines. Organizing such large data sets into forms where they can be effectively searched or understood can be very difficult. A common method is based upon the "divide and conquer" approach, in which successive division of large data sets leads, after some steps, to the amount of data in each piece being much more tractable. For example, after successively halving a data set with a trillion entries 20 times, there would be about a thousand entries in each remaining piece. Such successive division processes can be represented naturally by a branching structure known as a binary tree. It is important for efficiency that each step divides the pieces roughly in half; otherwise not all of the pieces will shrink to more manageable sizes. Trees also are a fundamental structure showing hierarchical relationships in a broad range of settings, from the data storage and searching technique just described to modeling evolutionary processes in biology. Tree distances measure how different two trees are and are important in assessing the degree to which two possible trees agree or disagree. There a number of different tree distances, which vary in what aspects of tree commonality they measure, their importance in different applications, and in their difficulty of calculation. This project seeks to better understand the distribution of multiple measures of distance on trees. This project studies both ordered trees, used in computational settings such as data and disk storage techniques, and unordered trees, arising in biological and other scientific studies. The statistical properties of the distribution, including averages, variance and asymptotic properties, present difficult challenges and will be approached both experimentally and via exact and asymptotic combinatorial methods. The project also works on new approaches to compute tree distances that are both important for phylogenetic applications and computationally feasible. Greater understanding of rotation operations used in tree balancing may lead to improved algorithms for trees used in broad classes of computation, underlying the efficiency of many large database structures. The project develops understanding of how to scale tree distances by understanding the distribution of distances, giving scientists studying biological questions better understanding for interpreting results of tree reconstruction efforts. The project involves undergraduate researchers in cutting-edge research projects, both in the development of large-scale computational experiments and in developing the theoretical foundations of the answers to these questions. This award by the Computational Mathematics Program of the Division of Mathematical Sciences is co-funded by the Biodiversity and Ecosystem Dynamics Program in the DEB Division of the BIO Directorate.
View original record on NSF Award Search →