Computational Analysis of Short Repetitive Motifs in DNA Sequences

$74,000R03FY2008LMNIH

University Of Texas Arlington, Arlington TX

Investigators

Linked publications & trials

Paper 20865483 Paper 20223737 Paper 20209015 Paper 18048317

Abstract

[unreadable] DESCRIPTION (provided by applicant): [unreadable] [unreadable] Computational analysis of various aspects of gene regulation, transcription factor binding in particular, is an important and well known problem. Adequately addressed, it would greatly improve our understanding of diseases and help with the development of treatments. However, despite the intensive efforts and the application of sophisticated models, the identification of the binding motifs in DMA sequences remains elusive. The raw sequence likely carries only a part of the regulatory signal, and it is often too short and subtle to be detected even by the most sensitive algorithms. Many software tools developed for this purpose exploit the clustering and over-representation of motifs in promoter regions, sometimes combining this method with other experimental or phylogenetic information. However, the fact that many short sequences appear to be over-represented in any segment of DNA, at least in comparison with completely random model, impedes the reliable discovery. [unreadable] [unreadable] This proposal seeks support to develop new software and apply it to the identification, visualization and analysis of repeated short (approximately 5-25 bases) degenerate motifs, in short (a few hundred bases) and long (entire chromosomes) DNA sequences. We intend to use this software on the human and other genomes, as well as on sequences of interest to our collaborators in biology and chemistry, in an attempt to systematically characterize short over-represented sequences. We shall determine which of these motifs correspond to the experimentally confirmed transcription factor binding consensuses, study their phylogenetic conservation and investigate their possible association with repeat families. Special attention will be paid to the upstream sequences of genes, and tools will be developed for a genome-wide search for related motif layouts. [unreadable] [unreadable] Our software will be based on an adaptation of classic string processing algorithms to address the inexact matches in a novel way, by combining the seed elements into statistically significant degenerate motifs. In addition to performing analysis with our collaborators, we will place the programs in the public domain, along with the other tools which we have already developed and published, inviting other investigators to use them on their own data. [unreadable] [unreadable] [unreadable] [unreadable] [unreadable]

View original record on NIH RePORTER →