Statistics of Sequence Comparison

$235,300ZIAFY2019LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 30596639 Paper 29336305 Paper 28771374 Paper 28002465 Paper 27192614 Paper 25294922 Paper 23294268 Paper 21998158 Paper 21702692 Paper 21702690 Paper 21128852 Paper 20657661 Paper 17068079 Paper 15509610 Paper 14663142 Paper 11139604

Abstract

The current direction of this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. Previous focuses had been the development of an improved method for multiple alignment that could identify the common elements shared by large and diverse protein superfamilies, and the extension of this method to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. This year the central focus this project was the statistical assessment of the three-dimensional clustering of distinguished positions, identified as characteristic of various nodes in a hierarchy. Our approach, called Initial Cluster Analysis (ICA), seeks to determine whether a set of distinguished elements within a linear array is clustered significantly near the start of the array and, if so, what is the most significant initial cluster of these elements. Abstractly, given a linear array of length L containing D '1's (the distinguished elements) and L-D '0's, it considers a generative model in which in which the '1's occur with particular and differing probabilities before and after a cut point X in the array. For any particular X it is relatively easy to calculate a likelihood Like(X) of the array of data, and one may optimize Like(X) by simply evaluating it for all possible X. However, the values of Like(X) for close values of X are highly correlated, dependent upon a calculable density of independent trials Rho(X). Because Rho(X) is not constant but rather grows approximately as the reciprocal of X's distance from 0 or L, simply optimizing Like(X) inherently favors, a priori, small or large values of X. Therefore, if one's application suggests no such bias, choosing to optimize Like(X)/Rho(X) rather than Like(X) for a given array of '0's and '1's may be a better strategy; we refer to this approach as using flattened priors. ICA estimates the effective total number of independent trials implicit in either optimization, which it uses in calculating a p-value for the optimal X. This provides a mathematically principled way to define an optimal initial cluster of distinguished elements, balancing the claims of very short and dense clusters with those of longer but sparser clusters. We published ICA in the Journal of Computational Biology. To analyze real proteins using ICA, we ordered the residues within a protein by their physical distance from a point of reference, and used our previously-developed hierarchical analysis to define a set of distinguished residues, characteristic of a protein family or subfamily. ICA then allows us to find sets of distinguished residues that are significantly clustered in three dimensions. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. This work was published in eLife. We initiated work on a new project to summarize and analyze the constraints on protein sequence and structure that may be derived from large multiple sequence alignments. For a particular protein, these constraints include those on amino acid usage in particular positions due to the protein's subfamily function, as well as those constraints characteristic of the family and superfamily of which the protein is a member. Additional constraints, which may be derived from DCA, are due to internal or heterodimeric pairwise interactions between different protein positions. The integrated analysis of these various constraints can suggest new lines for experimentation.

View original record on NIH RePORTER →