Integration of comprehensive cancer mutation and expression-associated data for biomarker evaluation and discovery

$449,998U01FY2017CANIH

George Washington University, Washington DC

Investigators

Linked publications & trials

Paper 35925813 Paper 34015823 Paper 33784373 Paper 33037820 Paper 32940334 Paper 32935101 Paper 32142370 Paper 30384176 Paper 29860481

Abstract

ABSTRACT Current technologies for cancer genomics research generate petabytes of data that are dispersed across multiple archives in a non-standard fashion. This dispersal poses major challenges to comprehensive analyses based on the integration of such data. Two common types of secondary data generated from sequencing- based studies involve mutation and gene expression associated with the cancer state as inferred from comparing matched tumor and normal samples. Massive collaborations like the Cancer Genome Atlas (TCGA) and the International Cancer Genomic Consortium (ICGC) are instrumental in facilitating the generation of the sequence data and providing a modicum of standardization through best practices, but they do not always follow the same standards between projects. Moreover, proprietary databases like the Catalogue of Somatic Mutations in Cancer (COSMIC) generally store and annotate data in a format uniquely optimized for their own database to meet individual business needs. Thus, integrating mutation and expression data across resources involves a massive undertaking with efforts devoted to data curation, unification, harmonization, and appropriate annotation for proper representation at a central location. Additionally, it is difficult to comprehensively collect and map protein functional sites to the mutation sites from a variety of databases such as UniProt, RefSeq, and many others because the underlying sequences in these databases can be different. To address this challenge, the Early Detection Research Network (EDRN) Associate Membership funded the development of BioMuta and BioXpress, cancer-associated mutation and expression databases, respectively, to provide access to unified data from several popular cancer repositories and functional data from well-known molecular biology resources. Links to BioMuta are available through the EDRN portal and UniProt. The focus of the proposed project is to provide a custom portal encompassing up-to-date releases of BioMuta and BioXpress leveraging the existing EDRN framework and data. This will provide a broader understanding of the cancer landscape moving toward the proteomic space and working synergistically with other ITCR resources. To supplement these data, we further propose to integrate normal expression data across several species that can be used to derive a deeper understanding of the cancer-associated expression profiles. Text-mining support will also be applied to the identified cancer-related mutation and expression profiles for evidence to aid in interpretation of the findings. It is expected that such large-scale integration of cancer data and supporting information will not only benefit cancer research, but will also become a critical necessity for ensuring the most efficient synthesis of information and therefore the earliest detection methods possible.

View original record on NIH RePORTER →