The Analysis of Signal Elements in Promoter Sequences.

$264,808Z01FY2007LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 17068091 Paper 16961919 Paper 15961489 Paper 14973045 Paper 14963262 Paper 14704356

Abstract

The signal elements in promoter sequences are not well characterized. We developed statistical tests to find nucleotide words (generally of length 8) that appear localized relative to TSSs (transcription start site). These words constituted "seeds" for expansion to develop PSSMs (position-specific scoring matrices) characterizing systems of co-regulated genes. To this end, Dr. Marino-Ramirez collected a database of about 4700 sequences around the TSS of human genes, later increasing the size of the database by about a factor of 2. The database was exceptionally well characterized, and ideal for our statistical study. We used a Poisson scan statistic to determine whether occurrences of a given 8-letter DNA word are clustered unusually relative to the TSS. The Poisson scan statistic also identified clusters of significant words. About 80 of these words occurred in two or three clusters. By validating our results with microarray data and gene ontology information, we were therefore able to show that the same 8-letter word could have two different biological functions, depending on its position with respect to the TSS. Although this kind of positional dependency is a known phenomenon, our study showed that it is widespread in the human genome. We have developed a database of positionally significant clusters and a Gibbs sampling program, A-GLAM, to further our exploration of transcriptional regulatory elements using anchored alignments. A-GLAM also now includes a post-processing step to find multiple instances of a transcriptional binding element in a single sequence. We have also evaluated (favorably) Bayesian sampling methods that incorporate positional information into A-GLAM's analysis.

View original record on NIH RePORTER →