Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly

$458,756R01FY2016HGNIH

University Of California Berkeley, Berkeley CA

Investigators

Lior S Pachtercontact Sreeram Kannan David Tse

Linked publications & trials

Paper 39143318 Paper 37738608 Paper 36713252 Paper 36712127 Paper 33051648 Paper 32484809 Paper 32135093 Paper 32085701 Paper 32034137 Paper 31521605 Paper 31142858 Paper 31042711 Paper 30566439 Paper 30034032 Paper 29949988 Paper 29523077 Paper 27905880 Paper 27230763

Abstract

? DESCRIPTION (provided by applicant): RNA-Seq has revolutionized transcriptomics and is one of the most important high-throughput sequencing assays invented in recent years. The key computational problem is that of de novo assembly: the reconstruction of the transcripts and their abundances from tens to hundreds of millions of short reads. The problem is challenging due to a confluence of several factors: large number of different transcripts (tens of thousands), long repeat across transcripts due to alternative splicing, widely varying abundances across transcripts, and the presence of read errors. Existing assemblers are mostly designed based on heuristic considerations and implement ad hoc methods that lead to unreliable transcriptome reconstructions. An accurate RNA-Seq assembler would enable more accurate identification of fusions in cancer transcriptomes, better gene annotations in model and non-model organisms, and more complete analyses of the dynamics of alternative splicing driving developmental and regulatory programs. In this proposal, we offer a systematic approach to the design of RNA-Seq assemblers based on information theoretic principles. We start by determining conditions data that guarantee that there enough information to reconstruct the transcriptome, and then propose an assembly algorithm that can reconstruct with the minimal information. This algorithm optimally uses the available read information to resolve repeats and disambiguate isoforms. A key insight derived from the information theoretic approach is that widely varying abundances across transcripts, rather than a complication, can actually be exploited as signatures of different transcripts to disambiguate among them. Based on our initial ideas, we have built, evaluated and compared an initial prototype with several existing software, on both real and simulated data. The encouraging results provide evidence that our approach, which we will fully develop, implement and evaluated during the funded period, can significantly outperform existing software. Additional functionalities such as mixed short/long read assembly, genome-assisted assembly and joint processing of multiple RNA samples, will be designed and incorporated into the software as part of the proposed project.

View original record on NIH RePORTER →