UniProt - Text Mining and Large Language models to Enhance Functional Annotation in UniProt of Protein Variants Associated with Alzheimer's Disease

$324,090U24FY2024HGNIH

European Molecular Biology Laboratory, Heidelberg

Investigators

Linked publications & trials

Abstract

Project Summary Text Mining and Large Language Models to Enhance Functional Annotation in UniProt of Protein Variants Associated with Alzheimer's Disease Alzheimerâs Disease (AD) is a common and devastating neurodegenerative disorder that currently has no effective treatment. Many genetic variants associated with AD have been identified and, in some cases, the molecular and clinical consequences of these variants have been elucidated and reported in the scientific literature. As the premier resource for protein sequence and function information, UniProt is in an ideal position to curate and disseminate this AD variant information to support research on novel therapies. Here, we propose to use a combination of text mining strategies, including our previously developed relation extraction tool eMIND, which captures information about the functional impact of variants in AD, as well as a cutting edge GPT Large Language Model (LLM), to enable high-throughput curation of variants of high interest to the AD community. We have previously developed eMIND, a transformer-based text mining system that extracts relations between AD variants and functional impacts from the literature, and have demonstrated that it has good performance on a selected set of AD-related abstracts. In this work, we will develop a pipeline to automatically process PubMed abstracts at full-scale using eMIND and output variant-impact relations as well as the abstracts and individual sentences from which they were derived. This information will be used in conjunction with the LLM for several purposes. First, we will use the LLM to create natural language summaries of the impact of AD-related variants based on sentences and abstracts identified by eMIND (Aim 1). By focusing the LLM on text already flagged as relevant by eMIND, we will increase accuracy while taking advantage of the LLMâs ability to synthesize information across sentences from multiple papers and produce a highly readable product. All summaries will be manually evaluated by expert curators. Next, we will develop a set of high-confidence variant-impact relations that can be used in applications, such as knowledge graph representation and learning, that require information in a structured format (Aim 2). We will use the LLM in two ways: (i) to assess the validity of the relations extracted by eMIND and (ii) to extract relations de novo from eMIND-positive sentences and abstracts. The results of both approaches will be manually assessed and compared. Finally, the most relevant relations and summaries of the functional impact of AD-related variants obtained in Aim 1 and Aim 2 will be made available to users through the UniProt API and website (Aim 3). We will seek feedback from the AD community about the quality and presentation of the information. Taken together, this work will enable scalable curation of AD variant information of high value to the AD community and serve as a model for a fully automated variant annotation workflow that can be generalized to other diseases.

View original record on NIH RePORTER →