High-throughput phylogenomic analysis of animal proteins

$365,223R01FY2005HGNIH

University Of California Berkeley, Berkeley CA

Investigators

Linked publications & trials

Paper 20080507 Paper 19558703 Paper 19443452 Paper 19435885 Paper 18428751 Paper 17708678 Paper 17488835 Paper 17288570 Paper 15759638 Paper 14734307

Abstract

[unreadable] DESCRIPTION (provided by applicant): [unreadable] With the completion of the sequencing of the first multicellular eukaryotic genome, Caenorhabditis elegans, in 1998, the Drosophila melanogaster genome in 2000, the human genome in 2001, and the pending completion of the mouse genome, investigators in animal genomics are facing new challenges in high-throughput analysis of the proteins encoded by these genes. Computational methods for protein function prediction are increasingly relied upon by biologists, for a first-pass annotation, and to prioritize wet-bench experiment. However, most of these methods do not provide sufficient information to enable informed prediction of specific protein function, and some of these methods result in systematic error, particularly those using function prediction by homology based on simple pair wise sequence comparison. It has become clear that phylogenomic analysis - function inference based on phylogenetic analysis of a protein in the context of its family members - is critical for accurate functional annotation. While phylogenomic analysis has been applied to the analysis of a number of protein families, a large-scale phylogenomic analysis of proteins in animal genomes has not yet been made available to scientists in the public sector. The work outlined in this proposal is designed to address this need, and to be complementary to existing tools. All proteins from animal genomes will be clustered into families based on global sequence similarity, and homologs will be gathered from other organisms. For each group, a multiple sequence alignment, phylogenetic tree, and subfamily classifications will be produced. Hidden Markov models will be generated to provide high-throughput classification ability, one for each protein family, and one for each subfamily identified. A web-server will be created, to enable investigators in both the private and public sectors to submit sequences for classification against these hidden Markov models, and a graphic user interface will display the correlation of changes in protein sequence with changes in structure and function [unreadable] [unreadable]

View original record on NIH RePORTER →