Computational Methods for Gene Discovery, Genome Annotation, and Genome Assembly

$2,842,773R01FY2025HGNIH

Johns Hopkins University, Baltimore MD

Investigators

Linked publications & trials

Paper 39565752 Paper 39298515 Paper 39285451 Paper 39185147 Paper 39071384 Paper 38900914 Paper 38798552 Paper 38746259 Paper 38562894 Paper 38526344 Paper 38508189 Paper 38260425 Paper 38098813 Paper 38076842 Paper 37940654 Paper 37904256 Paper 37811944 Paper 37794265 Paper 37577699 Paper 37546880 Paper 37503064 Paper 37165242 Paper 37000853 Paper 36993373 Paper 36762707 Paper 36519529 Paper 36448847 Paper 36261518 Paper 36171387 Paper 35648784 Paper 35440538 Paper 35362479 Paper 35361931 Paper 35357935 Paper 35357919 Paper 35120119 Paper 35100403 Paper 34897437 Paper 34734986 Paper 33983397 Paper 33964128 Paper 33635857 Paper 33361112 Paper 33275607 Paper 33274050 Paper 32948606 Paper 32887686 Paper 32859275 Paper 32796007 Paper 32766814 Paper 32760000 Paper 32589667 Paper 32489650 Paper 32487205 Paper 32432329 Paper 32398145 Paper 32188846 Paper 32034321 Paper 31986137 Paper 31898513 Paper 31782791 Paper 31553437 Paper 31406327 Paper 31375807 Paper 31271967 Paper 31097038 Paper 31097009 Paper 31064768 Paper 30824707 Paper 30486838 Paper 30455414 Paper 30445993 Paper 30373801 Paper 30124169 Paper 30083606 Paper 29954844 Paper 29735690 Paper 29713083 Paper 29648629 Paper 29373581 Paper 29361467 Paper 29340642 Paper 29069494 Paper 29028872 Paper 28963165 Paper 28637275 Paper 28586923 Paper 28502990 Paper 28481342 Paper 28390096 Paper 28369353 Paper 28369201 Paper 28130360 Paper 28117401 Paper 27854363 Paper 27852649 Paper 27749838 Paper 27621377 Paper 27560171 Paper 27318204

Abstract

Project Summary Thousands of new human genomes are being sequenced each year in efforts to understand the genetic causes of human diseases, and thousands of animal and plant genomes are being sequenced to answer a broad range of biological questions. In parallel with this increase in whole-genome DNA sequencing, RNA sequencing has exploded as well, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. To properly analyze the many diverse humans being sequenced, we need accurate, comprehensive catalogs of genes and transcripts, and we also need to move beyond our reliance on a single reference genome that is missing much of the variation found in the human population. We propose to address these challenges in multiple specific ways: first, we will develop new and improved gene discovery and genome annotation methods. This effort will include development of new algorithms for recognizing splice sites and protein-coding regions; a new eukaryotic gene finder based on a convolutional neural network; expansion of our strategy for using protein structure prediction to identify functional protein isoforms; and a new whole-genome annotation pipeline that will make extensive use of RNA-seq data to find both coding and noncoding genes and transcripts. Second, we propose to extend and enhance the CHESS (Comprehensive Human Expressed SequenceS) database, the only human gene database that has direct evidence of gene expression associated with nearly all of its genes. This effort will include mapping CHESS onto the complete T2T-CHM13 human genome, enhancing the value of CHM13 as a new human reference, and mapping CHESS onto multiple other individual human genomes. Our effort to expand the database will include analysis of many thousands of proposed novel genes and transcripts proposed from other sources, where we will use a combination of protein structure prediction, transcriptional evidence, evolutionary conservation, and ab initio gene prediction methods to evaluate these for inclusion in CHESS. We will also create an ancillary searchable database, CHESS+, that will contain millions of transcripts that have been assembled but not yet included in CHESS, as a community resource for gene discovery. Third, we will build upon existing high-quality draft assemblies to assemble gap- free versions of multiple new individual human genomes, chosen to increase the diversity of those previously published. We will develop an improved cross-genome annotation mapping system that will use both DNA and protein alignment, and use this system to annotate all of the new human genomes, which we will then compare to identify mutations affecting gene-containing regions. Finally, we will apply our latest genome assembly decontamination methods to identify contaminating DNA, which currently affects thousands of published genomes, and release corrected versions of all genomes in which we find contamination.

View original record on NIH RePORTER →