Structural Biology Information Resources
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
Protein three dimensional structures are drawn from the Protein Data Bank, an international collaboration supported in part by the NIH. Records are processed at NCBI to provide precise definitions of sequences, structure, and molecular interactions. Protein structure records are compared to all NCBI protein sequence records using the BLAST algorithm and furthermore compared to one another by the VAST structure-comparison algoirthm. These automated comparisons provide the cross references needed for linking protein and gene sequences in the NCBI collection to the biological function annoation provided by protein structure records.[unreadable] [unreadable] Informatics projects were needed this year to address the "remediation" project undertaken by the Protein Data Bank. This modified 100% of the over 50,000 structure records in the collection, providing revised sequences and structures for over 30% of the records. This "remediation" necessitated update of the entire NCBI database and calcuation of new neighboring/similarity relationships for very many structure records. Other informatics projects automated weekly updates and provided improved molecular graphics summaries for structure records in Entrez "docsum" and "Structure Summary" displays. A new project still under way clusters molecular interactions observed in related structures, to provide a concise summary of biological functions.[unreadable] [unreadable] The NCBI "Conserved Domains" Entrez database is in part drawn automatically from protein family alignments prepared by others. These include, for example, the "Pfam" collection prepared at the Wellcome Trust Sanger Institute and the "Protein Clusters" prepared at NCBI/IEB. Another component of the "Conserved Domains" database is expert-curated protein family alignments prepared by the staff of the project. Alignments consistent with known three dimensional structures are prepared using algorithms within the "Cn3D" program and conserved subfamilies consistent with phylogenetic evidence are derived using the "CDTree" program. Curators record biological functions as indicated by interactions observed in three dimensional structures within the family or other sources such as experimental studies in the literature. Protien sequences in the NCBI collection are automatically compared to conserved domain records using the "Reverse PSI-BLAST" algorithm.[unreadable] [unreadable] An informatics project undertaken this year developed an algorithm for linking protien sequences to any subfamily that specifically includes that protein. Cross validation experiments showed that "Reverse PSI-BLAST" scores within the range shown by the representative sequences in the curated subfamily alignment are near-perfect indicators of subfamily membership. Links from protein sequences to "Conserve Domains" have been modified to include ranking by this algorithm, so as to present the correct, subfamily-specific biological function first, at the top of the list. Another project includes interaction sites drawn from three dimensional structures in the default "Conseved Domains" link from sequence records. A project still in progress is to provide protien-structure-based alignments for a large fraction of conserved superfamilies, to improve biological function annotation more rapidly than is possible by phylogenetic characterization of all subfamilies.
View original record on NIH RePORTER →