Query Log Analysis for Improving User Access to NCBI Web Services

$1,997,904ZIAFY2021LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Abstract

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs updated online PubMed system. However, finding data relevant to a users information need is not always easy. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. The unfortunate arrival of SARS-CoV-2 and the COVID-19 pandemic has led to unprecedented focused biomedical research and new opportunities to distribute the information learned. Search is important to ensure that resources and relevant articles are available when needed. As the full text is available for more and more articles it become an important resource for identifying all relevant articles. We are working on algorithms to balance the value for retrieval of title, abstract, and each full text section. A challenge to finding all the relevant articles to a search query is that some articles use different terms, or synonyms, to describe the same concept. Many synonyms are listed in terminology collections. But as researchers create advances in multiple fields, new concepts, new terms, and new synonyms are continually created. We developed a method that used multiple word-embedding methods and converting similarity scores to probabilities for reliable comparison. Using this method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches for a couple years. However, simple keyword search can never capture all the subtlety of which articles are truly relevant to our needs. Sophisticated AI and NLP algorithms can do much better. But typical tools require too much specialized knowledge. LitSuggest is much more convenient. Given a set of interesting articles, it can provide a list of additional articles that might be of interest. Given feedback on which proposed articles are genuinely of interest, future lists of interesting articles will even more closely match the searchers needs. Properly identifying genes and chemicals in a research article is essential for understanding the article and retrieving the article when needed. Thanks to an exhaustive annotation effort by PubMed indexers and creative work by our team, we have developed NLM-Gene and NLM-Chem to identify chemicals and genes in the full text of an article, not just in the title and abstract. This is challenging because there is more confounding terms and phrases in the article body. It is important because many relevant genes or chemicals are mentioned only the body. SARS-CoV-2 has been a shock to the entire world. The medical research community has responded with a flood of studies on COVID-19 covering prevention, diagnosis, treatment and other areas. LitCovid provides quick, direct access to these articles. The articles are separated by broad area and can be searched for more specific articles. We continued the LitCovid development in 2021. We are not the only ones using NLP and AI to investigate the COVID medical literatures. We provided a review of much of this work. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. To understand this collection better, we applied NER and NLP tools, and other tools mentioned above, to provide an overview. We identified bioentities such as diseases, internal body organs, symptoms and co-morbidities. Their relationship to COVID-19 was determined via co-occurrence. We also automatically clustered articles by topic. We then recognized emerging topics and their growth. These tools could be used to understand any collection of articles more fully. Finally, we developed COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We believe that this work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic.

View original record on NIH RePORTER →