Improving Statistical Disclosure Methods for Protecting Confidential VA Data
Va Puget Sound Healthcare System, Seattle WA
Investigators
Abstract
DESCRIPTION (provided by applicant): Federal initiatives are promoting greater transparency and broader access to Federal data sources. The dissemination of Veteran data can facilitate advances in research, inform public policy, and further citizens' knowledge, but as the demand for data access increases, there are growing concerns regarding the privacy of Veterans' identities. Ideally, publically available data should be released in a way that realizes the benefits of sharing data while ensuring privacy of Veterans' confidential health information. Organizations have turned to de-identification, a technique for removing identifying information, such as personal names and social security numbers, from a data set before release. Organizations typically employ de-identification policies in the Privacy Rule [2] of the Health Insurance Portability and Accountability Act (HIPAA). Although it is widely believed that de-identified patient level health data is sufficientl resistant to privacy threats, published studies have documented the relative ease of re-identifying individuals by linking de-identified patient datasets with other available health data and applying widely available computing and statistical capabilities [3,4]. Statisticians have a long history of studying ways to protect the confidentiality of data while providing information to policymakers and the public, but there are limitations and methodological gaps in the existing statistical disclosure literature for estimating re-identification risk for a given dataset. The gol of this proposal is to develop a set of novel statistical approaches for estimating re-identification risk in the context of data-sharing policies associated with the HIPAA Privacy Rule, and to incorporate these results in subsequent recommendations to VA policy makers. To fill the gaps and overcome the limitations, the following four aims are proposed. Aim 1 develops new and more flexible statistical methods to assess overall global risk (i.e., risk for the entire released data file) and per-record risk (i.e., risk for each individual Veteran in the released data file); im 2 assesses the performance of new and existing estimation methods using imitation-data and real-data simulations; Aim 3 applies the newly developed methods to estimate re-identification risk of a de-identified VA dataset; and Aim 4 develops recommendations for the VA Re-identification Risk subcommittee on elements the VA must consider before releasing data, and actions the VA can take when faced with privacy threats. Aim 1 methods entail the development of new and more flexible parametric, non-parametric, semi- parametric, and Bayesian models to estimate re-identification risk measures. Methods to develop the newly proposed models include the following: improved distributional assumptions for parametric models; development of a Bayesian risk estimator using the likelihood function and posterior distribution; and goodness-of-fit tests to choose the best-fitting model. Aim 2 methods involve assessing efficiency and robustness of existing and newly developed estimation methods via imitation-data and real-data simulation. Aim 3 methods illustrate the application of existing and newly developed risk methods to a VA dataset. Finally, Aim 4 involves summarizing results from Aims 1-3 and literature reviews of other elements an agency must consider prior to releasing data; this entails working with the VA Re-identification Risk subcommittee to develop recommendations for VA disclosure control.
View original record on NIH RePORTER →