Generative AI for synthetic data: A framework to expand health data reach for research and ensure algorithmic fairness

$85,166K99FY2025LMNIH

Vanderbilt University Medical Center, Nashville TN

Investigators

Abstract

Dr. Yan received his Ph.D. in computer science with a focus on developing privacy-preserving technologies to defend information systems against data breaches. His growing enthusiasm for maximizing the potential of AI/ML to deepen the understanding of patient data has oriented him toward a career in biomedical informatics. Throughout this transition, Dr. Yan has been dedicated to devising new methodologies to optimize the utility of patient data and encourage its ethical use. Specifically, Dr. Yan has developed and evaluated multiple generative AI algorithms to produce high-fidelity and privacy-respecting synthetic health data for critical downstream tasks. Over the past several years, synthetic electronic health record (EHR) data generation powered by generative AI technologies has gained substantial attention in the health domain due to its ability to protect privacy, promote data sharing, and improve the performance of medical AI by providing datasets that are larger and more comprehensive. Despite its great potential, there are significant gaps between the current state of this technology and its maximal worth: the evaluation of synthetic EHR data is subpar, its development cannot leverage data sources privately owned by multiple institutions and integrate established medical knowledge (which largely limits the quality of synthetically generated EHR data), and it is unknown how to use synthetic data to produce fair and reliable medical AI. This proposal aims to innovate computational methods, realized in open-source software, to enable the assessment, development, and utilization of synthetic health data. Aim 1 focuses on the development of a multi-dimensional, customizable evaluation framework that appraises synthetic health data in terms of its utility, privacy, and fairness. This framework will further inform synthetic data creators of the appropriateness of a data generation model for a particular use case. Aim 2 will develop an ML architecture that allows multiple institutions to collaboratively train knowledge-integrated synthetic health data generation models using privately held datasets without data sharing. Aim 3 focuses on the development of a fairness-aware pipeline that 1) utilizes synthetic data to balance data distributions, 2) embeds fairness constraints into the model training process, and 3) is agnostic of the AI/ML models relied upon. With the assistance of a multidisciplinary mentoring team from VUMC, Penn Medicine, and Weill Cornell Medicine, Dr. Yan will leverage EHR data from these institutions, as well as the All of Us and MIMIC EHR data, to develop and evaluate the proposed computational methods. Dr. Yan will expand his expertise through training in fairness design, federated learning algorithms, biomedical informatics and statistical methods. This training will help Dr. Yan become an independent investigator in biomedical informatics to develop computational methods to maximize the potential of biomedical data in supporting reliable health research and applications that benefit all patients.

View original record on NIH RePORTER →