Bioinformatics
National Institute Of Environmental Health Sciences
Investigators
Linked publications & trials
Abstract
My main focus has been on methods for identifying transcription factor binding sites in sequences. For a transcription factor with a set of experimentally verified binding motif sequences, a position weight matrix (PWM) may be constructed from these known sequences by calculating the proportions of sequences for which each specific base, A, C, G, and T, occurs at each position in a set of aligned motif sequences. Once a PWM is constructed, it can be used to scan sequences for putative binding sites using a sliding window of length of the PWM to score how well each sequence segment in the window matches the PWM. A site is declared when the score passes a predefined cutoff. While this approach has provided useful hits to experimental investigators, one practical problem is that the false positive rate is often high. Short motifs can be found easily by chance in long sequences. The commonly used PWMs assume that the positions within a motif are mutually independent, i.e., a motif sequence follows a product of multinomial distributions. Thus, the observed frequencies of A, C, G, and T in each column are the maximum likelihood (ML) estimates of the distribution of the multinomial random variable for that column, regardless of the contents of nearby columns. Furthermore, the number of known instances of a transcription factor binding site in public databases such as TRANSFAC is typically small. The maximum likelihood (ML) estimates may be poor, as the estimators are vulnerable to overfitting when based on insufficient data. The resultant PWM models may be ineffective in distinguishing a true motif from a random segment. A further complication arises from the choice of cut point for declaring a site to be a motif. A less stringent cut point results in a large number of false positives whereas a more stringent cut point eliminates true positives. GADEM: A genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery Genome-wide analyses of protein binding sites generate large amounts of data;a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods without restricting the initial motif profiles. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (4-6 nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. Unlike any other de novo motif discovery tools, GADEM can also identify long (>40bp) motifs. For instance, in the P53 ChIP data, it identified several abundant long motifs (70-100bp) that happened to correspond to retroelements. GADEM discovered motifs of various lengths (6-100bp) and characteristics in these datasets containing from 0.5 to >13 millions nucleotides with run times of 3 to 72 hours. We believe that GADEM is an efficient tool for de novo motif discovery in large scale genome-wide data. Gene set enrichment analysis for non-monotone association and multiple experimental categories Recently, microarray data analyses using functional pathway information, e.g., gene set enrichment analysis (GSEA) and significance analysis of function and expression (SAFE), have gained recognition as a way to identify biological pathways/processes associated with a phenotypic endpoint. In these analyses, a local statistic, such as t-statistic, is used to assess the association between the expression level of a gene and the value of a phenotypic endpoint. Then these gene-specific local statistics are combined to evaluate association for pre-selected sets of genes. Commonly used local statistics include t-statistics for binary phenotypes and correlation coefficients that assume a linear or monotone relationship between a continuous phenotype and gene expression level. Methods applicable to continuous non-monotone relationships are needed. Furthermore, for multiple experimental categories, multiple GSEA/SAFE analyses are carried out. Methods that combine these analyses are needed. We use as the local statistic the coefficient of multiple determination (i.e., the square of multiple correlation coefficient) R2 from fitting natural cubic spline models to the phenotype-expression relationship. Next, we incorporate this association measure into the GSEA/SAFE framework to identify significant gene sets. Furthermore, we describe a procedure for inference across multiple GSEA/SAFE analyses. We illustrate our approach using a biomarker for liver injury (blood alanine transminase) and gene expression in liver and blood samples from rats treated with eight hepatotoxicants under multiple time and dose combinations. We set out to identify biological pathways/processes common to the eight compounds that are associated with liver injury. The proposed framework captures both linear and non-linear association between gene expression level and a phenotypic endpoint and thus can be viewed as extending the current GSEA/SAFE methodology. The framework for combining results from multiple GSEA/SAFE analyses may be used to address practical inferential problems. Our methods can be more generally applied to microarray data with continuous phenotypes with multi-level design or the meta-analysis of multiple microarray data sets. Collaborative research in sequence analysis We have collaborated with several investigators in the intramural research program on three major questions in DNA sequence analysis: 1) Where are the putative binding sites in one or several promoter sequences? 2) What motifs are enriched in a set of loci from ChIP experiments? 3) What motifs are enriched in a set of genes that are differentially expressed in one experimental condition compared to another? we have developed and implemented computational tools for addressing the three questions.
View original record on NIH RePORTER →