Bayesian Statistics and Algorithms for Homology Modeling

$252,000R01FY2003HGNIH

Institute For Cancer Research, Philadelphia PA

Investigators

Linked publications & trials

Paper 19603484 Paper 18989261 Paper 17483505 Paper 17080462 Paper 16187359 Paper 16187346 Paper 16187342 Paper 15980589 Paper 15797916 Paper 15152092 Paper 12930999 Paper 12912846 Paper 12717019 Paper 12163064

Abstract

DESCRIPTION (provided by applicant): To improve human health, a goal of the human genome project is to translate the genome sequence into an understanding of human biology. An important step in this process is knowledge of the structure of human proteins and the effects of sequence polymorphisms on structure and function. Currently, the structures of only 1000 human proteins are known, but the structures of up to one third or so of human proteins can be modeled based on the structures of homologous proteins in the Protein Data Bank. This fraction will increase rapidly due to structural genomics efforts. Unfortunately, general principles of what works in homology modeling and what does not have remained elusive. The reasons for this are several: 1) insufficient benchmarking of most prediction methods; 2) reliance on out-of-date statistical analysis of protein structures, performed without modem methods of statistics: 3) most modeling methods assume a relatively high level of sequence identity (>35 percent) between template structure and sequence to be modeled, when most proteins of unknown structure are only distantly related to proteins of known structure. The PI proposes benchmarking, new statistical analysis, and new algorithms for each of the three major aspects of homology modeling: alignment, building backbone coordinates for insertiondeletion regions, and sidechain placement. The primary tools will be Bayesian statistical analysis, including hierarchical models and non-parametric methods based on the Dirichlet process. The increase in size of the sequence and structure databases makes the new statistical analysis timely, both because of the increased power the new data provide, and the numerous applications afforded by more sequences and structures.

View original record on NIH RePORTER →