CAREER: Advancing evolutionary genomics and eukaryotic biodiversity research through accurate, scalable, and flexible frameworks for structural genome annotation
University Of Connecticut, Storrs CT
Investigators
Abstract
A high-quality annotation and associated genome assembly are necessary to understand how genes in a given organism work. Variation associated with genes, and their structure, provides a framework for examining morphological, physiological, and behavioral traits. In the era of high-throughput sequencing, the size and complexity of the genomes attempted has dramatically increased. Despite this, over 91% of these genomes contain a multitude of gene annotation errors. The Earth BioGenome Project intends to sequence 1.5 M Eukaryotic genomes in the next ten years. Related projects, such as the Vertebrate Genomes Project, the Global Invertebrate Genome Alliance, and the 10,000 Plant Genomes Project will contribute to exciting genomic contributions to biodiversity research. Reliable, efficient, and well-integrated software, that maintain connectivity to community data standards, will be critical to address the tremendous data generated by these initiatives. The EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) framework will tremendously ease the burden on researchers, many of whom, are attempting to assemble and annotate genomes with small teams. Collaborations with these small teams will support the development of an annotation platform that implements machine learning in a user-friendly package. At the same time, EASEL will improve the efficiency and accuracy by responding to the needs of larger and more complex genomes. Collaborations with these large-scale initiatives will support an intensive undergraduate internship program to connect biology students to big data, bioinformatics, and machine learning in the context of genome annotation. EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), an integrated and accessible deep learning framework for the annotation of eukaryotic reference genomes with limited or extensive external evidence, will be developed. The software will improve both evidence-based and ab initio derived gene models through a full workflow, that encompasses repeat identification through gene model annotation. Software development will be paired with research partnerships representing over 30 new eukaryotic genomes, including insects, plants, and animals. Following successful implementation, EASEL will be translated into a framework compatible with the Galaxy Toolshed so that it can be freely installed and executed through any local instance. A Tripal/Galaxy database module will be developed for installation on any Tripal clade or model organism web-based repository to provide analytical capacity in proximity to the genomic resources housed in community databases. Integration at the database level will be evaluated first within the forest tree genomics and phenomics resource, TreeGenes. Software development will be integrated in a multi-disciplinary research and education driven model. A new undergraduate summer training opportunity, Genome Assembly and Annotation, will guide students through a three module research experience that will culminate in an Annotation-thon and a total of ten new genome annotations. Software and results from this project will be distributed here: https://gitlab.com/PlantGenomicsLab/HBEF/-/tree/master/Annotation This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →