Building Better Genome Assemblies and Gene Models with RNA-Seq Data

$458,778FY2014BIONSF

Johns Hopkins University, Baltimore MD

Investigators

Abstract

This project will produce novel and critical bioinformatics algorithms and tools to improve genome assembly and annotation. The tools will be released free of charge and without restrictions to help biologists create better reference genomes and gene annotations for their species of interest. When integrated into a teaching module, these tools will also be available to enhance student research and education. With regard to outreach and training, this project will contribute to the larger effort to recruit students from diverse backgrounds into interdisciplinary science by creating summer research internship opportunities for students, particularly those from underrepresented groups and Baltimore inner-city high-schools, in collaboration with Johns Hopkins University's Center for Talented Youth. The software, documentation, teaching module and analysis data for the project will be accessible from http://ccb.jhu.edu/people/florea/research/ and from public repositories such as Sourceforge and the iPlant Collaborative environment. Gene assembly and genome annotation are the first and most important steps in decoding the genetic makeup of an organism. Next generation sequencing is fueling a potpourri of genome sequencing efforts for a growing number of plant and animal species. However, bioinformatics methods have been slow to adapt to the tremendous increase in pace and data volume. Many assembly projects simply forego painstaking but much needed curation steps even as evidence mounts to show the consequences that the quality of assembly has on its annotations and subsequent analyses. The project will build a much-needed suite of automated tools to leverage the rich RNAseq resources to first, improve the quality of genome assembly, and second, its gene and transcript annotations. The first set of tools will use contiguity properties of RNAseq reads to recruit unmapped contigs into a draft genome assembly to improve completeness and to discover assembly errors. The second set of tools will combine RNAseq and traditional (Sanger) cDNA sequences to produce a comprehensive set of gene and transcript annotations along the genome. The research will generate new graph-based algorithms that can be used by computational biologists in their future tool development efforts and will provide critical insights into the power and limitations of NGS data to help guide future plant genome sequencing projects. By applying the methods to data from ongoing sequencing projects, including that for the loblolly pine genome, it will contribute to a better reference genome for these species.

View original record on NSF Award Search →