Query Log Analysis for Improving User Access to NCBI Web Services

$1,791,061ZIAFY2022LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 35961013 Paper 35623021 Paper 35536809

Abstract

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs updated online PubMed system. However, finding data relevant to a users information need is not always easy. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. The unfortunate arrival of SARS-CoV-2 and the COVID-19 pandemic has led to unprecedented focused biomedical research and new opportunities to distribute the information learned. The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 180,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for 18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. In response, we developed LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires 18% of the inference time than the Binary BERT baseline. In addition, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset consisting of over 30,000 articles with manually reviewed topics was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g., 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. In 2022, we also benchmarked five DL models, Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT, for the task of semantic textual similarity (STS). We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R, and mean squared error as additional measures. Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). Finally, we made use of simple natural language processing programs and robust statistical tests in a collaborative project with researchers at NHGRI studying the evolving use of ancestry, ethnicity, and race in genetics research. Our computational method allowed them to analyze tens of thousands of pages easily and find associations between words. Similarly, in another collaboration with NCI researchers, we applied our machine learning research to classify literature and extract data at the intersection of three fields: liver cancer, health disparities, and epidemiology.

View original record on NIH RePORTER →