Semi-Automating Data Extraction for Systematic Reviews
Northeastern University, Boston MA
Investigators
Linked publications & trials
Abstract
? DESCRIPTION (provided by applicant): Evidence-based medicine (EBM) looks to inform patient care with the totality of available relevant evidence. Systematic reviews are the cornerstone of EBM and are critical to modern healthcare, informing everything from national health policy to bedside decision-making. But conducting systematic reviews is extremely laborious (and hence expensive): producing a single review requires thousands of person-hours. Moreover, the exponential expansion of the biomedical literature base has imposed an unprecedented burden on reviewers, thus multiplying costs. Researchers can no longer keep up with the primary literature, and this hinders the practice of evidence-based care. The long term aim of this work is to develop computational tools and methods that optimize the practice of EBM. The proposed work thus builds upon our previous successful efforts developing computational approaches that reduce the workload in EBM. More speci?cally, we aim to develop tools that semi-automate the laborious task of data extraction - identifying and extracting the information of interest (e.g., trial sample size, interventions and outcomes) from the free-texts of biomedical articles - via novel machine learning methods. Semi-automating this task will drastically reduce reviewer workload, thus enabling the practice of EBM in an age of information overload. Previous efforts to automate data extraction from articles describing clinical trials have shown promise, but lack the accuracy and scope necessary for real-world use. These approaches have been impeded by the absence of a large corpus of annotated clinical trials, and by the dif?culty of constructing models to automatically extract all of the variables necessary for synthesis. We describe methodological innovations to overcome these hurdles. First, to train our machine learning models we propose leveraging large existing databases that contain structured information about clinical trials, in lieu of the usual approach of collecting expensive manual annotations. Practically, this means we will be able to exploit a very large `pseudo-annotated' dataset that is an order of magnitude bigger than what has been used in previous efforts, thus substantially improving model performance. Our extensive preliminary work demonstrates the promise and feasibility of this approach. Second, we propose novel machine learning models appropriate for the tasks of article categorization and data extraction for EBM. These models will speci?cally be designed to perform extraction of multiple, correlated data elements of interest while simultaneously classifying articles into clinically salient categories useful for EBM. We will rigorously evaluate the developed methods to assess their practical utility, speci?cally y comparing automated extraction accuracy to that achieved by trained systematic reviewers. And to make these methods useful to end-users (systematic reviewers), we will develop and evaluate open-source software and tools, including a web-based extraction tool that integrates our machine learning models to automatically extract information from uploaded articles (PDFs). We will conduct a user study to evaluate the utility and usability of this tool in practice.
View original record on NIH RePORTER →