Theoretical Foundations and Software Infrastructure for Biological Network Databases

$481,900U01FY2015CANIH

Case Western Reserve University, Cleveland OH

Investigators

Linked publications & trials

Abstract

? DESCRIPTION (provided by applicant): Ever-increasing amounts of physical, functional, and statistical interaction data among bio-molecules, ranging from DNA regulatory regions, functional RNAs, proteins, metabolites, lipids, as well as those among genomic variants, offer unprecedented opportunities for computational discovery and for constructing a unified systems view of the cellular machinery. These data and associated formalisms have enabled systems approaches that led to unique advances in biomedical sciences. Unfortunately, however, storage schemes, data structures, representations, and query mechanisms for network data are considerably more complex, compared to other, at or low-dimensional data representations (e.g., sequences or molecular expression). This complexity is even more evident when we consider heterogeneity of possible interactions that can occur in the cell. For example, a pair of protein-coding genes can interact in a variety of ways: i) we can model physical interactions between their gene products, or their protein-protein interactions, ii) inter- action of a gene product with the promoter/enhancer/silencer region of the other gene, or iii) genetic interaction among double-mutants with significantly different phenotype than the effect of single mutations combined. This complexity is further evident, when one considers different versions of datasets, different techniques used for assaying and gathering the interactions between molecules, linkages across data, and interfaces with other tools. This project seeks to answer a number of fundamental questions that relate to efficient utilization of large network- structured datasets: - what are (provably) optimal storage schemes for large network structured databases? how should multiple versions of same/ related datasets be stored? how does one trade-off compression with query efficiency? and how does one suitably abstract network data so that users can interactively interrogate them using front-ends such as Cytoscape? This project aims to answer these questions by developing theoretically grounded and computationally validated storage schemes, algorithms, and software that will enable efficient and effective storage, update, processing, and querying of biological networks. We will develop compression techniques for efficient storage and version control mechanisms that allow users to create their own versions of networks, algorithms for efficient query processing on these networks, and implementations of these algorithms into broadly accessible and user-friendly software. This research will result in novel computational tools that will be disseminated to the community in the form of open source public domain software. Our tools will render network data fundamentally more accessible to the broader community in biomedical sciences. This will make use of network data more common place in applications including the identification of composite prognostic and diagnostic markers, disease gene prioritization, modeling of tumor het- erogeneity and progression in cancers, informing treatment, identification of therapeutic targets, and drug repositioning. From these points of view, the algorithms and software have far reaching and deep impact.

View original record on NIH RePORTER →