IMPROVE GENOME ANNOTATION USING MULTIPLE SEQUENCE ALIGNMENT RELIABILITY SCORES

$191,200R21FY2013HGNIH

University Of Illinois At Urbana-Champaign, Urbana IL

Investigators

Linked publications, trials & patents

Paper 25273068 Paper 25144359 Paper 24259430 Paper 24222208 Paper 23852726 Paper 23307812 Paper 23281792 Paper 23254332 Paper 23151582 Paper 22751099

Abstract

DESCRIPTION (provided by applicant): Comparative genomics is a powerful tool to discover functional elements in the human genome. The foundation of cross-species comparative genomics is multiple sequence alignment (MSA). Despite of the progress in the past decade, MSA is still a difficult task and error-prone. The alignment errors can directly affect the downstream analyses and may lead to incorrect biological conclusions. Many biomedical researchers have been using publicly available, precomputed MSAs in the Ensembl Browser and the UCSC Genome Browser to conduct various comparative genomic analyses. But these MSAs have errors. However, users often do not ask how reliable the alignment is or do not know how to quantitatively measure the reliability. Preliminary study suggests that a considerable amount of conserved elements in the current UCSC Genome Browser might be false positives introduced by unreliable MSAs. The impact of problematic alignment on the genome annotation may be much greater than we thought. In this project, novel probabilistic sampling-based scores to measure multiple sequence alignment will be developed. Context- dependent substitution models and more realistic models to handle insertions and deletions will be employed in order to apply the method to the genome wide scale with the capability of dealing with deep alignments from large number of sequences. In addition, the alignment reliability scores will be used to improve genome annotation. The data of functional elements in the human genome from the ENCODE project will be used to refine the model. The method will also be applied to pick up more functional elements that are originally missed because of the uncertainty in the alignment. Improvement on other types of genome annotations (e.g. RNA gene, positive selection) will also be explored. These new methods that capture MSAs reliability will greatly reduce the false positives in comparative genomics analysis that are introduced by alignment errors. If successful, the general methodology of comparative genomics can be improved and laboratory experiments that rely on computational studies based on MSAs will be much more effective. Results from the project will be integrated into the UCSC Genome Browser to benefit other researchers who use MSAs for various biomedical discoveries for disease related signatures. The method will potentially have meaningful impact on ENCODE, TCGA, Genome 10K, and other large-scale comparative genomics projects. This innovative project in computational biology will potentially have important impact on the genomics community and enable advancement in biomedical research.

View original record on NIH RePORTER →