Regional Oncology Research Center (LLMs for Unstructured Data Extraction)

$300,000P30FY2023CANIH

Johns Hopkins University, Baltimore MD

Investigators

Linked publications & trials

Paper 39763905 Paper 39723564 Paper 39700179 Paper 39666157 Paper 39569333 Paper 39472976 Paper 39435380 Paper 39434828 Paper 39433569 Paper 39414312 Paper 39403935 Paper 39395205 Paper 39392349 Paper 39390609 Paper 39363117 Paper 39361354 Paper 39341835 Paper 39320351 Paper 39314987 Paper 39314954 Paper 39298276 Paper 39281885 Paper 39269310 Paper 39240063 Paper 39211816 Paper 39198884 Paper 39198404 Paper 39186002 Paper 39164522 Paper 39156183 Paper 39154228 Paper 39142659 Paper 39113096 Paper 39107309 Paper 39093033 Paper 39092435 Paper 39079500 Paper 39078934 Paper 39076107 Paper 39040171 Paper 39008151 Paper 38959339 Paper 38955178 Paper 38940666 Paper 38935814 Paper 38921935 Paper 38915758 Paper 38894678 Paper 38889303 Paper 38885246 Paper 38865181 Paper 38829053 Paper 38826404 Paper 38798374 Paper 38782065 Paper 38740967 Paper 38739014 Paper 38735585 Paper 38714355 Paper 38701369 Paper 38698674 Paper 38688277 Paper 38659935 Paper 38641986 Paper 38640196 Paper 38619005 Paper 38589764 Paper 38587552 Paper 38584702 Paper 38551513 Paper 38547863 Paper 38538786 Paper 38538744 Paper 38522569 Paper 38478628 Paper 38460130 Paper 38456492 Paper 38450798 Paper 38443917 Paper 38421139 Paper 38382007 Paper 38370699 Paper 38360902 Paper 38355777 Paper 38352348 Paper 38330147 Paper 38309503 Paper 38301482 Paper 38298522 Paper 38272356 Paper 38266106 Paper 38260999 Paper 38250582 Paper 38201308 Paper 38185452 Paper 38180338 Paper 38167882 Paper 38132181 Paper 38112776 Paper 38112617

Abstract

Abstract Artificial intelligence (AI) has the potential to revolutionize healthcare by leveraging clinical data to advance research and improve oncology practice. Within free-text pathology reports, crucial information about primary cancer diagnoses and evolving molecular features is embedded. Extracting and interpreting this information accurately is essential for determining cancer stage, which plays a decisive role in prognosis and guiding clinical management. Although natural language processing (NLP) techniques have been applied to extract focused information from pathology reports, there is still a need for adaptable, generalizable, and interpretable strategies to enhance clinical data abstraction. To address this need, we propose a multidisciplinary approach to develop an integrative clinical information extraction pipeline. This work aims to improve, assess, and enhance the abstraction of relevant features of pathological diagnosis from pathology reports by leveraging large language models. Our research design involves several steps. First, we will establish a diverse and equitable cohort of patients from our Cancer Registry and collect free-text pathology reports, along with structured clinical data obtained from the Johns Hopkins School of Medicine Precision Medicine Analytics Platform (PMAP) Data Commons. Next, we will employ an information extraction platform to identify pathological features from the reports. This platform will utilize a suite of models, including BERT-like models, GPT-3.5, and GPT-4, provided by Microsoft, specifically designed for identifying key cancer attributes. Subsequently, we will evaluate the output of individual models using the CASPER interactive model development framework, enhancing and refining the results through heuristics and weak supervision. The augmented model output will be presented through a web-based user interface, allowing expert curators to provide further input. We will then compare the effectiveness of each CASPER-augmented model and its derived pathological features against the established gold standard annotations from the Cancer Registry. Finally, we will enhance the GPT-based language models based on the assessment, curation, and comparison process, employing prompt engineering techniques to improve performance and mitigate bias.

View original record on NIH RePORTER →