GGrantIndex
← Search

Privacy-protecting distributed analysis of biomedical big data

$475,315U01FY2017EBNIH

Harvard Pilgrim Health Care, Inc., Boston MA

Investigators

Linked publications, trials & patents

Abstract

ABSTRACT Advances in technology, bioinformatics, and data science have made it possible to analyze large and complex databases to generate evidence that improves public health and accelerates the development of precision medicine. However, the advent of big data has also raised concerns about privacy and confidentiality. This application is focused on data privacy in vertically partitioned data, a data environment where information about an individual is available in two or more data sources. This type of data structure is common in biomedical research and is expected to grow exponentially as information from the same individual is increasingly collected in multiple sources, such as insurance claims databases, electronic health records, registries, social media, wearables, and mobile devices. Combining multiple databases provides a more complete health profile about the patient and generates more robust evidence. However, concerns about data privacy, confidentiality, and security, and constraints in governance and institutional agreements make it highly challenging or sometimes impossible to physically pool different data sources. We propose to develop an open-source, freely available software tool that will employ a cutting-edge method ? distributed regression ? to analyze vertically partitioned datasets. The method does not require data to be combined physically, but produces statistically equivalent results as if the datasets were linked and pooled centrally at one site. Instead of sharing patient-level information, participating sites will only transfer non-identifiable information matrix (a design matrix used in fitting of statistical models) and other summary-level statistics needed in the statistical modeling process. This approach offers much greater protection for data privacy while allowing one to perform sophisticated statistical analysis. The software tool will be developed, tested, and fine-tuned using both simulated datasets and the real-world data from Optum Labs, which houses one of the largest vertically partitioned datasets in the U.S. with claims and electronic health record data from over 5 million patients. The tool will be made compatible with PopMedNetTM, an open-source data-sharing platform currently used by several large national initiatives such as the NIH Health Care Systems Research Collaboratory Distributed Research Network, the PCORI-funded National Patient-Centered Clinical Research Network (PCORnet), and the FDA-funded Sentinel program. The tool is therefore highly scalable and can have immediate impacts on real-world big data analysis. The multidisciplinary study team includes researchers who pioneered some of the distributed regression approaches and experts who have extensive experience in multi-center studies. The distributed regression method has great potential to shift the paradigm of multi-center big biomedical research, from transferring of potentially identifiable patient-level data to the sharing of non-identifiable summary-level information. The proposed software tool will be a major step towards real-world application of this state-of-the-art privacy-protecting analytic approach.

View original record on NIH RePORTER →