Regional Oncology Research Center (LLMs for Unstructured Data Extraction)
Johns Hopkins University, Baltimore MD
Investigators
Linked publications & trials
Abstract
Abstract Artificial intelligence (AI) has the potential to revolutionize healthcare by leveraging clinical data to advance research and improve oncology practice. Within free-text pathology reports, crucial information about primary cancer diagnoses and evolving molecular features is embedded. Extracting and interpreting this information accurately is essential for determining cancer stage, which plays a decisive role in prognosis and guiding clinical management. Although natural language processing (NLP) techniques have been applied to extract focused information from pathology reports, there is still a need for adaptable, generalizable, and interpretable strategies to enhance clinical data abstraction. To address this need, we propose a multidisciplinary approach to develop an integrative clinical information extraction pipeline. This work aims to improve, assess, and enhance the abstraction of relevant features of pathological diagnosis from pathology reports by leveraging large language models. Our research design involves several steps. First, we will establish a diverse and equitable cohort of patients from our Cancer Registry and collect free-text pathology reports, along with structured clinical data obtained from the Johns Hopkins School of Medicine Precision Medicine Analytics Platform (PMAP) Data Commons. Next, we will employ an information extraction platform to identify pathological features from the reports. This platform will utilize a suite of models, including BERT-like models, GPT-3.5, and GPT-4, provided by Microsoft, specifically designed for identifying key cancer attributes. Subsequently, we will evaluate the output of individual models using the CASPER interactive model development framework, enhancing and refining the results through heuristics and weak supervision. The augmented model output will be presented through a web-based user interface, allowing expert curators to provide further input. We will then compare the effectiveness of each CASPER-augmented model and its derived pathological features against the established gold standard annotations from the Cancer Registry. Finally, we will enhance the GPT-based language models based on the assessment, curation, and comparison process, employing prompt engineering techniques to improve performance and mitigate bias.
View original record on NIH RePORTER →