Leveraging Unlabeled and Pseudo Data for Clinical Information Extraction

$414,798R15FY2019LMNIH

George Mason University, Fairfax VA

Investigators

Linked publications & trials

Paper 39225779 Paper 37550244 Paper 36754129 Paper 36050787 Paper 35953524 Paper 34586386 Paper 34333635

Abstract

Project Summary/Abstract Electronic Health Records (EHRs) contain significant information that can benefit many downstream uses. However, most of this information is in unstructured narrative form and is inaccessible to computerized methods that rely on structured representations for exploring, retrieving, and presenting the information. Natural language processing (NLP) and information extraction (IE) open this trove of information to studies that would otherwise be without. Over the past decades, many IE systems have been developed. These systems have typically focused on one task at a time. In addition, most have studied only specific types of records, e.g., discharge summaries, and addressed their task on data from a single institution. Performances achieved by the state-of-the-art IE systems developed under these conditions ranged from 44% F-measure to 99% F-measure. This observed variation can be attributed to the nature of the tasks: some target entities like dates tend to be better represented in the data and also more rigidly stick to known patterns of expression as opposed to reasons for medication administration which are relatively sparse in the data and can show wider linguistic diversity. However, this may not be the only reason: the data used can also explain the performance variation. Narratives of EHRs vary in their style, format, and content going from one department to another, from one hospital to another. Even the same record type in two different hospitals can be very different in narrative style and pose different challenges for IE. Understanding IE performance therefore requires studies of multiple tasks on multiple record types that come from multiple institutions. One major bottleneck for evaluation of IE systems on such a large scale is annotation. The same bottleneck also limits system development. This proposal aims to address this bottleneck for both evaluation and development. It first generates a multi-institution corpus consisting of multiple record types from five institutions. It studies four different IE tasks that broadly represent IE in clinical records and can inform the field of IE as a whole: de-identification, clinical concept extraction, medication extraction, and adverse drug event extraction. Within the context of these IE tasks, the proposal then puts forward methods that learn from unlabeled or pseudo data that can help alleviate reliance on annotated data for development. It evaluates these methods both for performance and generalizability on multiple types of records from multiple institutions. As a result of these activities, this proposal generates de-identified data, annotations, methods, software, and machine learning models which it then makes available to the research community.

View original record on NIH RePORTER →