Similarity-Based Indexing and Integration of Protein Sequence and Structure Databases

$498,117FY2008BIONSF

Ohio State University Research Foundation -Do Not Use, Columbus OH

Investigators

Hakan Ferhatosmanoglucontact Chenglong Li Yusu Wang

Abstract

The Ohio State University is awarded a grant to develop database indexing and similarity search technologies to manage, analyze, and integrate protein sequence and structure databases. Searching for similar sequences and structures in genomic and proteomic databases is a fundamental task in bioinformatics. As the size of the available data increases rapidly, it is essential to build indexing schemes so that integrated maintenance and querying of both sequence and structure data can be achieved effectively. To address this challenge, this project uses a unified theme for both types of data: extracting key features and mapping them into compact feature vectors spaces to facilitate construction of integrated index structures with sensitive, accurate, and efficient querying capabilities. For the sequence data, the project will develop novel feature extraction that involve physiochemical properties of the amino acids and detect low level of similarities. For the structural data, the project will develop methods to capture local structural motifs using contact maps and spatial motifs. In both cases, compact representation of features will be constructed, as well as efficient structure to index them. The approach incorporates biochemical proteins of molecules into feature extraction to discover functional sites of proteins and to return biologically relevant query results. Finally, based on the unified feature representation and indexing framework, the project will develop methods to integrate sequence and structure data effectively at various levels. A holistic approach combining sequence and structure data would help to overcome the limitations of each, and provide more accurate query results. The results of the project will benefit a wide range of application areas in natural and health sciences, including: comparative and functional genomics, protein modeling and design, drug development, and preventative and personalized medicine. Software developed in this project will facilitate large-scale genome-wide research projects which require iterative and interactive querying of available sequence and structure databases. The novel representations and sensitive motif extraction methods developed are also applicable to biological data visualization, classification, and multiple alignment problems. The software and the results of this project will be available at the website: http://bio.cse.ohio-state.edu.

View original record on NSF Award Search →