Methods and Software to Enhance Genomic Privacy and Sharing of RNA-Seq Data

$242,576U01FY2017EBNIH

Yale University, New Haven CT

Investigators

Linked publications & trials

Paper 32693796 Paper 32487214 Paper 31797643 Paper 28381262 Paper 26828419

Abstract

Abstract Privacy is receiving much attention with the unprecedented increase in the breadth and depth of biomedical datasets, particularly personal genomics datasets. Most studies on genomic privacy are focused on protection of variants in personal genomes. Molecular phenotype datasets, however, can also contain substantial amount of sensitive information. Although there is no explicit genotypic information in them, subtle genotype-phenotype correlations can be used to statistically link the phenotype and genotype datasets. We will study the methodologies for analysis of sensitive information leakage from phenotype datasets. We will focus on the RNA-seq datasets and the associated sources of sensitive information leakage. These leakages are mediated by the expression quantitative trait loci. We will approach the privacy analysis under 3 aims. We will first aim at proposing statistical metrics that can be used for quantification of the sensitive information leakage from phenotype datasets. These quantifications can be used to evaluate the risks of privacy breaches. In the second aim, we will focus systematical analysis of how linking attacks can be instantiated and analyzed. We will study how one can generalize linking attacks that enables the privacy researchers study the risks associated with these attacks more systematically. We will then evaluate different models of genotype prediction and assess how these can be used in linking attacks. We will focus, specifically, on the outlier gene expression levels and evaluate how the outliers can be used for genotype prediction and in the linking attacks. In the third aim, we will develop tools that implement the quantification, risk estimation, and risk management methodologies and integrate these in a coherent software suite for a comprehensive privacy analysis, which enables protecting RNA-seq datasets at different levels of summarizations of the datasets, e.g., reads, gene and transcript quantifications. We will aim at increasing the number of software tools for genomic privacy analysis. We will study different algorithmic approaches to tackle with the high computational complexity of anonymization techniques in the literature. We will study sources of sensitive information leakage other than gene expression levels, e.g. splicing and non-coding transcription. These sources of information will be studied in the context of risk quantification and management strategies presented in the previous aims. We will finally use the tools to quantify the sensitive information in the publicly available datasets from large sequencing projects, for example ENCODE, 1000 Genomes, TCGA, GEUVADIS, and GTex.

View original record on NIH RePORTER →