CAREER: De novo assembly of duplicated sequences in vertebrate genomes
University Of Southern California, Los Angeles CA
Investigators
Abstract
Many genomes contain long, low-copy repeated sequences called segmental duplications. These arise as a result of mistakes of genome replication, and when they overlap genes or other functional DNA can have an effect on fitness of an organism and evolution of a species. Duplicated sequences have been historically difficult to study. Sequencing instruments can only read short fragments of DNA, and entire genomes must be reconstructed based on how the DNA fragments overlap, much like assembling a jigsaw puzzle. Repeated sequences longer than the length of fragment that an instrument can read cannot correctly be included in an assembly. Improvements in sequencing technology that have increased the length of fragments that may be read and algorithms for assembling genomes have enabled increased resolution of repeats, creating an opportunity to study segmental duplication in diverse species. The aim of this project is to develop computational tools to study segmental duplications in nonhuman genomes, first through constructing methods to catalog duplicated genes in multiple genomes, and next to improve methods for assembling duplicated sequences. This can help improve our understanding of how genomes have evolved through duplication, and shed light on what genes have played a role in adaptation. This project also supports the creation of a USC Summer Genome Program, where undergraduate students from the greater Southern California area participate to sequence, annotate, and publish on a genome in preparation for continuing to an advanced postgraduate degree. The initial focus of the project is to develop a method to curate duplicated genes in vertebrate genomes. This will be accomplished using a computational pipeline to detect duplicated genes that have been resolved in multiple copies in an assembly as well as identify genes that have missing copies. Genes that have missing copies are identified through excess coverage of reads mapped back to the assembly. When the missing copies of genes have paralog-specific variants, additional copies of the duplicated gene may be resolved using a more sensitive approach to assembly. A method that was previously developed to resolve missing genes that contain paralog-specific variants, SDA, will be improved to operate on new data types and address more complex duplication organizations. Finally, a combination of duplication resolution and curation will be applied to quantify gene duplication across over 100 genomes sequenced by the Vertebrate Genome Project. The results of this project will be listed under publications and resources at chaissonlab.usc.edu. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →