Bayesian models and Monte Carlo strategies in identifying protein or DNA sequence motifs

$160,246FY2006MPSNSF

Purdue University, West Lafayette IN

Investigators

Abstract

This project pursues new probability models and Monte Carlo strategies to detect functionally relevant sequence motifs within protein or DNA sequence information. Protein motifs are defined as local segments (20-50 amino acids) that are critical for protein structures and functions. DNA motifs are referred to as specific sequence elements (6-20 nucleotides) in the genome that bind to transcription factors. Despite many advances, current computational tools for aligning sequence motifs are often lack of convergence and produce too many false positives. The investigator proposes: 1) To develop Bayesian models for protein sequence motifs that combine distributions of amino acids with distributions of sequence-derived secondary and tertiary characteristics. 2) To develop a parallel tempering procedure that runs multiple Markov chains in parallel to improve the convergence of motif alignment algorithm. 3) To develop new probability models and statistical methods that describe modules of transcription factor binding sites (TFBSs) and combine genomic sequence information with gene expression information. The rapid progress of the human genome project and development of biotechnologies have created a rich source of sequence data. Making sense of these data, however, is an ongoing challenge and in part depends on efficient computational and statistical approaches. This proposal focuses on computational detection of functionally relevant sequence motifs. Knowledge of protein motifs gives indication of protein functions, and knowledge of DNA motifs provides information for controlling gene expression. The proposed statistical and computational approaches will help to interpret and model the biochemical processes that regulate gene activity and the responses of cells to diverse stimuli. The eventual applications could lead to better understanding of diseases and drug developments.

View original record on NSF Award Search →