Statistics of Sequence Comparison

$192,751ZIAFY2017LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 30596639 Paper 29336305 Paper 28771374 Paper 28002465 Paper 27192614 Paper 25294922 Paper 23294268 Paper 21998158 Paper 21702692 Paper 21702690 Paper 21128852 Paper 20657661 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

The current direction of this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. A previous focus had been the development of an improved method for multiple alignment that could identify the common elements shared by large and diverse protein superfamilies. A central aim this year was to extend this method to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We apply the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. We completed the development of a hierarchical multiple alignment program, and applied it to the analysis of N-acetyltransferases. Based upon statistical correlations, this approach identified a number of subfamilies, characterized by protein positions with distinctive amino acid usage, which suggested specific, previously uncharacterized biological mechanisms. A paper describing this this work was published. Another aim of this project, launched last year, was significantly advanced. The hierarchical models constructed by our approach include the explicit description of a set of distinguishing positions characteristic of each node in the hierarchy. When mapped only available three-dimensional structures, these positions often cluster together in space, and can aid in the development of specific hypotheses for the biological mechanisms underlying the diversification of protein subfamilies. We developed appropriate measures for the clustering of distinguished positions, and derived methods to assess their statistical significance. A paper describing this work is in press. Work continues on extending the clustering measures to allow them to capture more biologically relevant information.

View original record on NIH RePORTER →