Computational Methods for Gene Discovery, Genome Annotation, and Genome Assembly
Johns Hopkins University, Baltimore MD
Investigators
Linked publications & trials
Abstract
Project Summary Thousands of new human genomes are being sequenced each year in efforts to understand the genetic causes of human diseases, and thousands of animal and plant genomes are being sequenced to answer a broad range of biological questions. In parallel with this increase in whole-genome DNA sequencing, RNA sequencing has exploded as well, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. To properly analyze the many diverse humans being sequenced, we need accurate, comprehensive catalogs of genes and transcripts, and we also need to move beyond our reliance on a single reference genome that is missing much of the variation found in the human population. We propose to address these challenges in multiple specific ways: first, we will develop new and improved gene discovery and genome annotation methods. This effort will include development of new algorithms for recognizing splice sites and protein-coding regions; a new eukaryotic gene finder based on a convolutional neural network; expansion of our strategy for using protein structure prediction to identify functional protein isoforms; and a new whole-genome annotation pipeline that will make extensive use of RNA-seq data to find both coding and noncoding genes and transcripts. Second, we propose to extend and enhance the CHESS (Comprehensive Human Expressed SequenceS) database, the only human gene database that has direct evidence of gene expression associated with nearly all of its genes. This effort will include mapping CHESS onto the complete T2T-CHM13 human genome, enhancing the value of CHM13 as a new human reference, and mapping CHESS onto multiple other individual human genomes. Our effort to expand the database will include analysis of many thousands of proposed novel genes and transcripts proposed from other sources, where we will use a combination of protein structure prediction, transcriptional evidence, evolutionary conservation, and ab initio gene prediction methods to evaluate these for inclusion in CHESS. We will also create an ancillary searchable database, CHESS+, that will contain millions of transcripts that have been assembled but not yet included in CHESS, as a community resource for gene discovery. Third, we will build upon existing high-quality draft assemblies to assemble gap- free versions of multiple new individual human genomes, chosen to increase the diversity of those previously published. We will develop an improved cross-genome annotation mapping system that will use both DNA and protein alignment, and use this system to annotate all of the new human genomes, which we will then compare to identify mutations affecting gene-containing regions. Finally, we will apply our latest genome assembly decontamination methods to identify contaminating DNA, which currently affects thousands of published genomes, and release corrected versions of all genomes in which we find contamination.
View original record on NIH RePORTER →