Machine Learning and Natural Language Processing for Biomedical Applications

$3,873,366ZIAFY2023LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 39169867 Paper 39114977 Paper 39043988 Paper 38630520 Paper 38572754 Paper 38514400 Paper 37994677 Paper 37388909 Paper 37268776 Paper 37254254 Paper 37171899 Paper 37131884 Paper 36882099 Paper 36694118 Paper 36471749 Paper 36350613

Abstract

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs updated online PubMed system. However, finding data relevant to a users information need is not always easy. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. For instance, queries with similar information needs tend to have similar document clicks, especially in biomedical literature search engines where queries are generally short and top documents account for most of the total clicks. Motivated by this, we present a novel architecture for biomedical literature search, namely Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that augments a dense retriever with the click logs retrieved from similar training queries. Specifically, LADER finds both similar documents and queries to the given query by a dense retriever. Then, LADER scores relevant (clicked) documents of similar queries weighted by their similarity to the input query. Our results demonstrate that LADER achieves new state-of-the-art (SOTA) performance on TripClick, a recently released benchmark for biomedical literature retrieval. Using advanced machine-learning and NLP techniques, we are able to provide enhanced access to special topics in the biomedical literature. One such example is for tracking variant-related information from relevant genomic literature, a crucial task for genomic research and precision medicine. We previously developed LitVar, a semantic search system that makes use of advanced text- and data-mining techniques to identify and normalize variant information in full-length articles. In 2022, we launched LitVar 2.0, a significantly improved system that features several major expansions over its predecessor, including: (1) improved variant recognition accuracy; (2) the inclusion of variant information from article supplementary data; (3) more powerful search capabilities; and (4) a redesigned user interface for more convenient results navigation. Another successful example is LitCovid, a literature database of COVID-19 related papers in PubMed that was first created and first launched in 2020. To date, LitCovid has accumulated over 360,000 articles with millions of accesses since its inception. Approximately several thousand new articles are added to LitCovid every month in 2023. In response to the continuing evolution of the COVID-19 pandemic, significant updates to LitCovid have been made over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. In addition to providing enhanced access to specific literature information as discussed above, directly extracting useful knowledge from the biomedical literature holds potentials for accelerating literature-based discovery, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Manually labeling training data for building biomedical named entity recognition (BioNER) algorithms is costly, due to the significant domain expertise required for accurate annotation. As a result, current BioNER approaches are prone to overfitting and suffer from limited generalizability. In response, we proposed a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. Specifically, we introduced AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluated AIONER on 14 BioNER benchmark tasks and showed that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrated the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). Since late last year, ChatGPT, a general purpose chatbot developed by OpenAI, has been widely reported to have the potential to revolutionize how people interact with information online. Like other large language models (LLMs), ChatGPT has been trained on a large text corpus to predict probable words from the surrounding context. ChatGPT, however, has received substantial popular attention for generating human-like conversational responses, and new developments are occurring rapidly. Recent work has discussed applications of ChatGPT for medical education and clinical decision support. However, health care professionals should be aware of the drawbacks and limitationsand potential capabilitiesof using ChatGPT and similar LLMs to interact with medical knowledge. In a recent perspective, we envision that a retrieve, summarize, and verify paradigm could greatly benefit biomedical information seeking. This approach leverages the impressive capability of LLMs to generate high-level summaries while minimizing the risk of directly using false or fabricated information by combining LLMs and search engines. Augmenting LLMs with domain-specific tools such as database utilities is another way to facilitate easier and more precise access to specialized knowledge. To this end, we developed GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12).

View original record on NIH RePORTER →