Machine learning methods in data science for clinical prediction and characterization

$826,301ZIAFY2025LMNIH

National Library Of Medicine

Investigators

Linked publications, trials & patents

Abstract

Clinical risk scores can degrade in new environments when the inputs are collected in a different way, and the problem may be magnified with changes to collection practices over time. To address unknown degradation levels of machine learning models that could in principle result in harm, we develop models robust to anticipated variations in time, by accounting for them during training, and by training suites of models that have known dependencies on reliable data elements. In other domains modeling approaches such as data augmentation and dropout are used to improve model robustness to expected dataset shifts: e.g., substitutions, missingness, and translations, respectively. With this framework, we develop robust models specifically for the effects of time and censorship of electronic health records (EHR) data with applications to modeling inpatient and critical care processes and in national claims data for pharmacovigilance. For longitudinal tasks such as risk forecasting and reasoning about disease progression, EHR data quality can be improved by using both tabular and textual data, with the caveat that the textual data arrives after a delay. Better longitudinal models may be constructed if the concepts rather than the full text each are associated with times. While previous investigators have focused primarily on time vis-a-vis temporal relations, our work provides temporal alignment to tabular data to enrich forecasting data availability. To achieve this, we have investigated human-in-the-loop annotation and machine learning modeling, applied to a corpora of 300,000 de-identified notes and 125,000 case reports. Our work draws upon and advances research in longitudinal visualization, large language modeling, and active learning. Our studies demonstrate increased precision in identifying concept event times with high levels of agreement with temporal relations work, and large language models frameworks that are competitive with clinical experts in extracting and timestamping clinical events. The main aim of this line of research is to identify the best ways to create high-quality and large-scale resources that enable improved clinical reasoning and longitudinal research.

View original record on NIH RePORTER →