Detect small and very small in-space inversions and improve alignment quality

$316,455R15FY2010HGNIH

Northern Illinois University, Dekalb IL

Investigators

Linked publications & trials

Paper 25273068 Paper 21994225

Abstract

DESCRIPTION (provided by applicant): Multi-species alignments at a genome-wide level have been valuable resources for annotating the human genome. The current alignment programs are geared toward identifying substitutions and insertions/deletions, but fail to find short inversions. The core computational models in most aligners explicitly account for substitutions and insertions/deletions that are as small as one base. However, an inversion needs to form a strong alignment to be included in the final results. An in-space inversion is a genomic rearrangement where a sequence interval is replaced by its reverse complement. Small in-space inversions can easily confuse aligners and cause misalignment. This type of misalignment becomes more serious when there are more species to be aligned, since the chance that a multi-alignment block contains at least one in-space inversion increases with more species. In addition, very small inversions cannot be detected by current aligners due to limitations of their computational methodologies. The prevalence and characteristics of these very small inversions have not been studied. The big discrepancy of previous results on small inversions between human and chimpanzee indicates that detecting small inversions is very difficult even for closely related species. We have identified a set of small inversions (in ENCODE regions) that are misaligned in existing alignments (e.g. on the UCSC or Ensembl genome browsers). Some of these inversions are relatively large (up to 278 bases), and some are very short (just a few bases). We propose to develop efficient computational tools to identify small (e.g. <300 bases) and very small (e.g. <30 bases) in-space inversions between human and model organisms genome- wide, and subsequently correct alignments (having verified inversions) to improve their accuracy. We plan to reuse existing tools (e.g. LASTZ) to produce backbone alignments, and update the multiple alignment package TBA/Multiz to align in-space inversions properly. We will implement our own heuristic procedures to identify small and very small in-space inversions, since there are not yet adequate solutions for this problem. We will use phylogeny information and evolutionary properties of the species, together with probability analysis, to verify very small inversions. Upon completion of the project, we expect to obtain a better understanding of the characteristics of small inversions across the evolutionary span of mammals. This pipeline of efficient software tools will also be able to detect small inversion polymorphisms among human individuals on a large scale. At the same time, misalignments caused by in-space inversions will be corrected, and alignment accuracy will be significantly improved. These corrections will be updated in the UCSC Genome Browser. The NHGRI currently has an on-going program to "characterize the human genome and the genomes of selected model organisms", and we intend to characterize the small and very small in-space inversions within these genomes. In addition, more importantly, the more accurate alignments will facilitate any downstream research in comparative genomics that is based on sequence alignments. After all, wrong alignments lead to wrong conclusions. PUBLIC HEALTH RELEVANCE: Inversion is one type of the genomic rearrangements that cause differences between human and other species including chimpanzee, as well as among individual humans, and may cause genetic diseases. We will create computational methods that will greatly increase our knowledge of these genomic differences, and these results will help to understand the origin of some diseases.

View original record on NIH RePORTER →