ProteoSeq - An Integrative Computational Framework for Proteotranscriptomics

$351,280R01FY2017GMNIH

University Of California Los Angeles, Los Angeles CA

Investigators

Linked publications & trials

Paper 32781045 Paper 32589925 Paper 29304370 Paper 28754146

Abstract

PROJECT SUMMARY In eukaryotes, one gene can give rise to multiple protein isoforms through various types of alternative pre- mRNA processing (e.g., alternative splicing), contributing significantly to proteome complexity. Differential isoform expression manifests in pathogenesis of diseases from heart failure to neurodegeneration, as well as cellular responses to environmental stress including alcohol and oxidative damage. Advances in RNA-seq technology have led to the discovery of many novel alternative isoforms, but their biological impact is often unclear in the absence of protein information. Conversely, shotgun proteomics technology enables large-scale characterization of proteins, but the limitations of ?one-gene, one-product? databases prohibit their utility in protein isoform identification. Deeper insights into the biology of alternative isoforms require combining the complementary strengths of transcriptomics and proteomics. Accordingly, the integration of technical platforms from mRNA to protein has become an indispensable step in advancing a holistic portrait on gene products. Among the key challenges is the segregation of proteomics and transcriptomics repositories, as well as the disconnect of respective data analysis pipelines and expertise. Despite recent progress, there is an urgent and unmet need for well-integrated and user-friendly computational platforms that can support everyday biomedical researchers in harnessing diverse data types for multi-omics studies. The central goal of this project is to create a unified platform to decode alternative isoforms from RNA- seq/Ribo-seq data, and to guide shotgun proteomics characterization of protein isoforms. Our approach capitalizes on the rapid revolution of Big Data sciences in recent times, where new frontiers in multi-omics integration now make it possible to traverse heterogeneous computational resources and data types seamlessly. We will design, construct, and implement an integrative proteotranscriptomics framework (ProteoSeq), which will combine novel analytical models and custom proteomics workflows to coalesce transcriptomics and proteomics data for large-scale characterizations of alternative protein isoforms. Our proposal details three data science aims, which will (i) develop methods to infer full-length mRNA and protein isoforms from hybrid (short-read/long-read) RNA-seq and Ribo-seq data; (ii) engineer an integrative platform for users to analyze protein isoforms from proteotranscriptomics data on the cloud; and (iii) validate and accrue protein evidence for alternative isoforms in diverse high-value datasets. Our efforts aim to synergize two currently fragmentary omics fields and thereby empower inquiries on the regulations of alternative isoforms in health and disease. We envision the proposed computational tools will be generalizable to multiple biomedical disciplines, and will serve the broad scientific community for routine multi-omics investigations in translational medicine.

View original record on NIH RePORTER →