CAREER: Algorithms for Gene Family Evolution with Gene Duplication, Loss, and Coalescence
Harvey Mudd College, Claremont CA
Investigators
Abstract
Evolution of genes and genomes is responsible for the immense biological diversity on our planet. However, despite its central role as the most fundamental property of life, the process of evolution remains poorly understood, and current models have typically been unable to span the diversity of scales at which evolution can act. This work addresses this fundamental shortcoming by developing new models and algorithms that simultaneously account for the most prevalent processes in eukaryotic gene family evolution: gene duplication, loss, and coalescence. The new computational framework and methods will enable researchers to systematically interpret data sets, make substantially more reliable and robust inferences, and improve our understanding of genome evolution. Because these histories form the basis of many genomic studies, these results will, in turn, benefit many areas of biology. Additionally, the PI is committed to educating the next generation of scientists. As part of this project, the PI will provide compelling research experiences for a substantial number of undergraduates, develop undergraduate courses in Data Science, and continue engaging in several college-wide and external initiatives aimed at broadening participation in STEM. This research develops models and algorithms in the field of phylogenetic reconciliation, which compares a gene tree with its species tree to infer the evolutionary events that link them. For eukaryotic organisms, the most popular reconciliation methods allow for gene duplications and gene losses, which is appropriate only for species sampled at large evolutionary distances, or allow for coalescences, which is appropriate only for species sampled at close evolutionary distances. That is, each model provides only a partial view of evolution, limiting their applicability and accuracy. By bridging these two models, this research will impact how gene family evolution is represented and how reconciliations are inferred and analyzed. Specifically, this work addresses three key problems in the field: algorithmic challenges of scaling to large datasets, statistical challenges of distinguishing biological signal from noise, and modeling challenges of generalizing across genomes. Expected contributions include novel algorithms and heuristics for the reconciliation problem, methods for resolving multifurcating trees, and models that can account for multiple samples per species and for species hybridization. In addition, the joint evolutionary models and inference algorithms developed here may motivate further unified approaches in the field. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →