A Modular Framework for Accurate, Interpretable, and Reproducible Analysis of Long Read RNA-Seq Data

$638,112R01FY2025HGNIH

Univ Of North Carolina Chapel Hill, Chapel Hill NC

Investigators

Linked publications & trials

Abstract

PROJECT SUMMARY RNA sequencing is extensively used in biological and biomedical research to assess the expression of transcripts within and across biological samples. Recent technological advances allow sequencing of entire transcripts, enhancing isoform identification and allele expression analysis. As with short read RNA-seq, iterative algorithm development with design considerations towards the data and metadata infrastructure is key to support accurate scientific conclusions and facilitate computational reproducibility for users. Many specialized bioinformatic methods are available for long read data that are able to detect novel transcripts and to adapt algorithms for quantification of transcript abundance to the particular characteristics of long reads. However there is a need for easy-to-use pipelines for multi-sample analysis that can be run by biologists or other analysts without advanced bioinformatic training, covering read processing, feature discovery, quantification, feature aggregation, and inference/visualization. We aim to provide such a modular pipeline with enhanced accuracy, interpretability, and reproducibility for long read RNA-seq, easily deployable on cloud platforms such as AnVIL. The team of investigators has substantial experience developing efficient and easy-to-use tools for transcriptomics analysis, having designed ultra-fast algorithms for sequence read processing, modular pipelines leveraging best-in-class methods from abundance estimation to statistical analysis, and end-user workflows that are widely used in research labs in academic institutes and industry. The goal of the project is to develop and enhance RNA-seq analysis tools for new long read datasets, focusing on improving accuracy and interpretability, in particular for multi-sample datasets. In this proposal, we will 1) expand quantification tools for long read RNA-seq including novel models for quantification and correction for errors and technical bias; 2) simplify multi-sample analysis with high-level tools that augment processing steps, extend our automatic provenance detection framework to support new use cases in long read, and support data-driven feature aggregation; and 3) broaden user engagement by publishing workflows and tutorials, and hosting in-person and virtual workshops on AnVIL. The impact of this work will be to deliver software for long read RNA-seq that is user-friendly, integrates with existing pipelines, and is made accessible on cloud platforms like AnVIL.

View original record on NIH RePORTER →