UniProt - Protein sequence and function embeddings for AI/Machine Learning readiness
European Molecular Biology Laboratory, Heidelberg
Investigators
Linked publications & trials
Abstract
PROJECT SUMMARY UNIPROT PROTEIN SEQUENCE AND FUNCTION EMBEDDINGS FOR AI/MACHINE LEARNING READINESS - SUPPLEMENT REQUEST 2022 Artificial intelligence and machine learning (AI/ML) has the potential to advance biomedical research. The overall goals of this Supplement application for the UniProt parent grant (U24HG007822) are to (i) support the AI/ML community by providing protein sequence embeddings and make use of these embeddings for accurate and fast protein clustering in UniRef production, (ii) explore methods of embedding UniProt functional annotation data, and (iii) engage with the AI/ML community to advance AI/ML readiness of UniProt data. UniProt has been a leader in the provision of protein sequence and annotation data since its inception in 2002. UniProt provides gold standard training data for hundreds of AI/ML applications in biomedical research. Protein sequence embeddings show enormous promise for protein clustering and structural and functional analysis and prediction. By providing UniProt protein sequence embeddings, we will increase the accessibility of sequence embeddings, reduce duplication of effort in the community, and establish a standard that can facilitate evaluation and comparison of models. We will test different embedding methods, focusing on the most widely adopted methods as determined by our recent survey of the user community. We will also investigate using sequence embeddings to speed up sequence clustering in UniRef production. Similarly, there are many functional annotations in UniProt that are amenable to embedding for use in AI/ML models. As a test case, we will explore embedding of Rhea biochemical reaction annotations for enzymes and transporters. In addition to disseminating the embeddings, we will develop methods to visualize them and compare them to existing enzyme classification systems. Finally, to ensure that our work aligns with community needs, we will organise a workshop to work with the community on their use cases and applications for embeddings. We will invite various stakeholders, including researchers that participated in the embeddings survey, participants of the metal binding site prediction challenge currently underway as part of the parent grant, as well as NIH representatives. Overall, the work in this proposal will enhance the readiness of UniProt for use in AI/ML and will integrate AI/ML methods into UniProt production.
View original record on NIH RePORTER →