New methods for quantitative modeling of protein-DNA interactions

$349,990R01FY2017GMNIH

Duke University, Durham NC

Investigators

Linked publications & trials

Paper 33975875 Paper 33087930 Paper 31170499 Paper 31114870 Paper 29730780 Paper 29605182 Paper 28691125 Paper 28211871 Paper 27087856

Abstract

ABSTRACT Accurate predictions of transcription factor (TF)-DNA interactions across the human genome are critical for deciphering transcriptional regulatory networks in healthy and diseased cells, as well as for understanding the phenotypic effects of polymorphisms in non-coding genomic regions. However, the most widely used model of TF-DNA binding affinity, the position weight matrix (PWM), is known to provide only an approximation of the true sequence specificity of TFs, because it assumes independence among the base pairs in TF binding sites. More complex binding models have been proposed, but their improvement over PWMs was marginal, either because of limitations of the training data (i.e. due to strong biases, noise, artifacts, or confounding factors) or because the models were not flexible enough to capture complex dependencies in TF binding sites. As a result, current DNA binding models have a limited ability to predict the effects of non-coding genetic variation on TF binding, and they cannot be used to resolve functional differences between closely related TFs with similar DNA binding domains but distinct regulatory roles in the cell. The objective of this application is to overcome these limitations by generating high quality data that will be used to train flexible statistical models to generate TF-DNA binding affinity predictions with accuracies similar to experimental in vitro assays. The central hypothesis, based on preliminary results and previous work, is that both better affinity data and better statistical models are needed in order to predict TF-DNA interactions in human cells with significantly higher accuracy than current models. High quality binding affinity data for 40 human TFs will be generated in Aim 1 using a unique combination of in vitro assays carefully designed to minimizes bias and noise, thus making the data ideal for training complex models. Novel TF-DNA binding models will be developed in Aim 2 using state- of-the-art statistical methods: support vector regression, nonparametric Bayes modeling, and conditional tensor factorization. The models will be tested experimentally in vitro, and by leveraging in vivo data from the ENCODE project. In Aim 3, the new binding models will be used in two applications: 1) to predict the quantitative effects of non-coding single nucleotide polymorphisms on TF binding affinities and TF binding levels, and 2) to predict differential in vivo DNA binding of closely related TFs with similar DNA binding domains but distinct regulatory functions in the cell. Such applications are not possible using current models. Overall, we anticipate that the binding affinity models developed in this project will allow for much more accurate predictions of regulatory TF-DNA interactions than possible using current models, which is significant because it will lead to a better understanding of gene regulatory programs and their misregulation during disease, including understanding the cascade of events that link genetic variation to human disease.

View original record on NIH RePORTER →