UKRI/BBSRC-NSF/BIO: Unifying Pfam protein sequence and ECOD structural classifications with structure models

$1,146,310FY2022BIONSF

University Of Texas Southwestern Medical Center, Dallas TX

Investigators

Abstract

Proteins are complex organic molecules essential for all functions of life. A study of a protein begins from learning about its evolutionary relatives. Proteins tend to retain their function as they evolve. Therefore, a more closely studied relative can inform researchers about the properties of a less understood relative, suggesting hypotheses to be tested. A protein’s relatives are cataloged in public, freely available, classification databases. Although amino-acid sequences are known for most proteins, spatial structures have only been experimentally determined for a small fraction. Protein classification can be based on either protein sequences (including those with yet unknown 3D structure), or protein spatial structures (augmented with sequences). Structure-based classifications are more accurate but include fewer proteins, necessarily missing those whose structures are unknown. Recently developed revolutionary structure prediction methods that can produce accurate 3D models for any protein sequence bridge these sequence-only and structure/sequence classifications. Now is the time to bring the two classification types together through a synergistic collaboration. The teams in the United States (ECOD: Evolutionary Classification of protein Domains database, mostly structure-based) and the United Kingdom (Pfam: Protein families database, mostly sequence-based) will work together to make their two databases consistent with each other and more accurate for the benefit of scientists and the broader community. The results of these protein classifications are readily incorporated into many other resources, such as Wikipedia pages, and thus are widely used by an audience in science and education. The ECOD and Pfam teams will collaboratively classify more than 1 million protein structure models generated by AlphaFold and RoseTTAfold into protein families of close evolutionary relatives. Existing families will be expanded with additional proteins. New families will be defined by sequence profile similarity aided by structure analysis in a manner consistent with the Pfam classification standards. The project requires upgrading the software infrastructure to process millions of models and to synchronize the two classifications in terms of domain identifiers and family names. Synchronization of the ECOD and Pfam classifications will be achieved in four ways: 1) By defining new families in Pfam for domains currently present only in ECOD; 2) By rectifying the ECOD classification such that all domains are classified into a Pfam family (defining new ones where necessary); 3) By splitting Pfam domain families containing multiple ECOD domains into multiple families containing single domains. 4) By making consistent collaborative decisions about domain boundaries and family classification using proteins with 3D models in both databases. These consistent domain definitions and classifications will facilitate broad generation of functional inference and detection of evolutionary insights in the scientific community and the public at large. Lastly, the internet architecture will be upgraded to serve these domain data to the broader scientific community through web portals. The results of this project can be found incorporated into both ECOD <http://prodata.swmed.edu/ecod> and Pfam <http://pfam.xfam.org>. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →