Collaborative Research: CIBR: CloudForest: A Portable Cyberinfrastructure Workflow To Advance Biological Insight from Massive, Heterogeneous Phylogenomic Datasets

$231,564FY2019BIONSF

Florida State University, Tallahassee FL

Investigators

Abstract

Variation across inferred gene trees is arguably the most consistent and striking observation from empirical phylogenomic studies, yet many unanswered questions remain about the causes of this variation. The questions persist in part because modern phylogenetic inference is still deeply influenced by a decades-old paradigm. Data from one or a few genes were typically gathered at the same time, combined into a single dataset, and analyzed by a single program that estimated a shared tree. While the size and complexity of datasets has changed radically in recent years, many aspects of this general workflow pervade. Most current approaches do not naturally integrate inferences from different sources, whether different studies or software packages, and even cutting-edge methods that model differences in gene histories still summarize these histories as a single "species tree" topology. More versatile tools are needed to understand the heterogeneity inherent to modern genomic datasets. Key to this versatility is the ability to flexibly and seamlessly move between different stages of a phylogenetic workflow, from inference of individual gene trees to exploration of the genome-wide phylogenetic landscape and, ultimately, to learning about the biological processes that have shaped variation across the genome. Each of these stages may rely on different analytical tools and software. The major aim of this project is to develop a cyberinfrastructure workflow called Cloudforest to address outstanding challenges in phylogenomics and provide researchers with a set of streamlined tools to explore and understand variation in evolutionary history across different regions of the genome (i.e., gene tree variation). CloudForest will allow users to leverage diverse computing resources that range from laptops, to HPC clusters, to cloud-based resources like JetStream or Amazon Web Services. CloudForest will meet many of the outstanding needs of empirical phylogenomic studies, such as (1) visualizing variation across gene trees, (2) revealing structure in sets of trees (forests), (3) conducting hypothesis tests regarding the causes of gene-tree variation, and (4) detecting genes that may have outlying (and potentially aberrant) histories. By addressing these challenges in a consistent way across computing platforms, CloudForest will allow biologists to make efficient use of any computational resource at their disposal with workflows appropriate for addressing a variety of important, unresolved questions in both evolutionary biology and other applied fields. This project also aims to advance broader goals by (1) supporting broad educational and training opportunities for researchers from around the world in the use of advanced computing solutions, (2) actively promoting the involvement and achievements of researchers from underrepresented groups in computational biology, (3) providing unique, interdisciplinary training opportunities for graduate students at the intersection of computing, math, and biology, (4) contributing to the development of an interactive and visually rich website for learning about phylogenetics and phylogenomics, and (5) facilitating applied phylogenetic research that will advance human health and well-being. A public facing web site for this project can be found at https://github.com/jwilgenb/CloudForest. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →