Structural Bioinformatics Information Resources

$4,766,549Z01FY2007LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 19591200 Paper 19454035 Paper 18984618 Paper 18940862 Paper 17868687 Paper 17596268 Paper 17425794 Paper 17170002 Paper 17166515 Paper 17135202 Paper 17135201 Paper 17105653 Paper 16707662 Paper 16385001 Paper 16381840 Paper 15608222 Paper 12520055 Paper 11752307 Paper 10592236

Abstract

Structure-based alignment and annotation of functional sites is accomplished using the "Cn3D" viewer/editor. Residue sites conserved across a protein domain family are identified on a reference protein structure, typically consisting of conserved core structural elements. Aligned "columns" are then used for phylogenetic tree calculation and functional site annotation. Continuing work has added a fast "block alignment" algorithm and an automated alignment "refiner." These procedures are in daily use by the CDD curator team. The Cn3D program itself is also widely distributed, though used largely as a structure and structural alignment viewer rather than an alignment editor. The program has been downloaded 560,000 times during the first 7 months of 2007. [unreadable] [unreadable] Phylogenetic analysis and construction of family hierarchies is accomplished using the CDTree editor. Given a multiple alignment, the tool displays "sequence trees" computed by a variety of algorithms, linking these to "tree of life" displays of taxonomic distribution. Using CDTree, CDD curators identify ancient conserved families based on sequence similarity and the presence of sequences from "sentinel" taxonomic groups. Linking to sequences, literature, domain architecture and other information, curators author functional annotation for each family. CDTree also supports "updates," functioning as a client to search the sequence database with family models and providing tools to select representative sequences and distribute them appropriately across the family hierarchy. This functionality has been refined based on very frequent exchanges between the CDTree bioinformatics and curator teams. Used primarily as an interactive tree viewer by outside users, CDTree has been downloaded 30,000 times since its release in October 2006.[unreadable] [unreadable] Web and other information services based on CDD have also been the subject of continuing work. In collaboration with the BLAST team, the profile-construction and sequence-profile comparison algorithms used in CDD have been reconciled and merged with PSI-BLAST libraries, facilitating export of reverse PSI-BLAST search tools and the CDD profile library. Over the past year the CDD library has been downloaded 17,000 times for in-house searching. Protein-CDD links within Entrez have been improved by development of a continuous neighboring system, whereby domain annotations for new sequences are usually made available within minutes. Compact representations of domain annotation convenient for display in the GenPept reports used by Entrezs protein sequence database are updated at the same time, as are compact functional site annotations from curated CDD families.[unreadable] [unreadable] To date the CDD curator team has created roughly 4,200 family alignments, falling in 500 separate superfamilies or hierarchies. In addition to authored functional synopses, this set includes 7,300 functional site annotations, citations to 21,000 PubMed articles, and citations to 4,400 on-line textbook chapters. It is difficult to summarize this work briefly, other than to comment that the synopses, trees, and annotations associated with family hierarchies take on the character of thorough review articles, and are a rich source of information on evolutionary diversification and biological function. One can conclude that the curator team can produce a detailed functional annotation at a reasonable rate.[unreadable] [unreadable] The content of NCBIs macromolecular structure database is derived from the Protein Data Bank. Structures are cast into a computer-friendly ASN1/XML format suitable for molecular graphics visualization and automated comparative analysis. Structure records are interlinked within Entrez to the corresponding sequences, PubMed articles, conserved domain records, and the chemical structures of ligands in PubChem. Protein structures are automatically neighbored by direct structure-structure comparison and an on-the-fly structure search service is provided for structural biologists. Entrezs macromolecular structure database currently contains 45,000 structures, divided for neighboring into 203,000 structural domains. Neighboring currently generates 370,000,000 structure-structure alignments and superposition matrices. Detailed structure and structure neighbor summaries are currently accessed about 19,000 times per day by roughly 6,000 users. About a dozen on-the-fly structure searches are performed per day by structural biologists, approximately equal to the number of new structures deposited daily into the Protein Data Bank.[unreadable] [unreadable] Recent development work has focused primarily on re-engineering for efficiency. Processing of Protein Data Bank files has been made completely automatic. While this results in a small information loss, automation has allowed my colleagues and I to perform updates more frequently and to process the very large volume of "remediated" files the Protein Data Bank has released this year. Structure neighbor calculations have been re-engineered for computation on a "farm" of rack-mounted processors. Though structure-structure comparison is a relatively slow calculation, parallel processing has eliminated the bottleneck in structure neighbor updates for the foreseeable future. Through collaboration with the sequence "data flow" team, "related-structure" links from Entrez protein sequences to the protein structure database are now updated continuously. Revisions of Entrez/Structure, related structures, and structure neighbors have also been deployed.[unreadable] [unreadable] Colleague Lewis Geer, postdoctoral fellow Ming Xu, and I have further developed the Open Mass Spectrometry Search Algorithm, OMMSA. Goals are to provide a method for matching spectra to sequence databases that is extensible by virtue of open-source coding and scoring statistics adaptable to new fragmentation regimes. This project has been successful; it has been accepted by the specialist proteomics community, with over 500 downloads of different versions of OMMSA and its result browser. Some independent studies also comment favorably on performance, for example Balgley et al., who say "The open source OMSSA algorithm from NCBI is exceptional in that its performance significantly exceeded that of each of the other algorithms by almost every measure." My colleagues and I have also developed a web service for conducting OMSSA searches vs. NCBI sequence databases. However, this has received only light use, a couple of searches per day on average, as most proteomics groups prefer to download software and search locally. Future plans are to continue to support OMSSA as research software.

View original record on NIH RePORTER →