Enhancing Synthetic Data Techniques for Practical Applications

$400,000FY2022SBENSF

Duke University, Durham NC

Investigators

Abstract

This research project will advance statistical and computational methods for releasing high quality synthetic data as public use files. In the face of high and expanding risks of unintended and/or illegal disclosures, many data stewards are considering synthetic public use files. These comprise simulated records, with values generated from statistical models estimated with the confidential data. This can reduce disclosure risks, since it can be difficult to re-identify individuals and their sensitive attributes when the released values are simulated. Despite growing interest in synthetic data solutions for data dissemination, there are significant gaps in the theory and methods of synthetic data that complicate and hinder practical implementations. This project will address three critical yet unresolved topics in synthetic data, namely (1) assessing data subjects' disclosure risks, (2) facilitating data analysts' evaluation of their synthetic data inferences, and (3) generating synthetic datasets in surveys with complex designs. The results of this research will offer federal agencies, survey organizations, research centers, and other data producers the means to create safer and more analytically useful synthetic data products. In turn, this will help data stewards to better meet the challenges of public use data dissemination. The project will train Ph.D. and undergraduate students to become researchers in data privacy protection methods, thereby contributing to the pipeline of experts in data privacy and in statistics and data science more broadly. The project also will develop and disseminate software code that implements the various approaches. This research project will address three main questions. First, the project will develop computational techniques for estimating Bayesian posterior probabilities of disclosures; that is, probabilities that sensitive values can be learned from the synthetic data releases. These techniques facilitate disclosure risk assessment on datasets with many observations and many variables, thereby allowing agencies to replace current ad hoc assessments with principled and quantifiable measures of disclosure risk. Second, the project will develop novel verification measures that data stewards can use to provide feedback to secondary data analysts on the quality of their particular inferences without leaking too much information about the confidential data. The new measures will enable formally private verification of common survey-weighted estimation tasks. Third, the project will develop new synthesis and inferential methods and recommend best practices for incorporating complex survey designs in synthetic data. These new methods will adapt Bayesian bootstraps, multiple imputation, and multilevel regression and poststratification for data synthesis, while also enabling the use of popular techniques from machine learning for generating synthetic data in complex samples. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →