Finding Protein Sequence Motifs--methods And Applications
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. Lately, these methods for protein motif search are being complemented by deep learning computational methods. During the year under review, we have continued and expanded our investigation of the proteins domains, particularly, those that are encoded in the genomes of viruses of prokaryotes and eukaryotes as well as Asgard archaea that are the closest archaeal relatives of eukaryotes. The enormous diversity of viruses is far from being completely understood, and numerous protein domains, particularly those involved in virus-host interactions, remain to be studied. During the least year, we have thoroughly explored the proteins encoded in the genomes of bacteriophages assembled from metagenomic sequences, including crAss-like phages, the most abundant human associated viruses, and identified a variety of domain not previously detected in viruses. In addition, we performed a comprehensive analysis of the proteins encoded in the genomes of orthopoxviruses, a family of large animal viruses including smallpox virus, and identified the domain composition of several uncharacterized virus proteins leading to testable functional predictions. In collaboration with the laboratory of Dr. Feng Zhang, of the Broad Institute of MIT and Harvard, we studied human proteins containing various derivatives of the capsid proteins of retroviruses and retrotransposons. Eukaryotic genomes contain numerous domesticated genes from integrating viruses and mobile genetic elements. Among these are homologs of the capsid protein (known as Gag) of long terminal repeat (LTR) retrotransposons and retroviruses. We identified several mammalian Gag homologs that form virus-like particles and one LTR retrotransposon homolog, PEG10, that preferentially binds and facilitates vesicular secretion of its own messenger RNA (mRNA). It was shown that the mRNA cargo of PEG10 can be reprogrammed by flanking genes of interest with Peg10's untranslated regions. Responding to the challenges posed by the COVID-19 pandemic, we studied the functional interaction between the domains of the RNA-dependent RNA polymerase of SARS-CoV-2. The catalytic subunit of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA-dependent RNA polymerase (RdRp) Nsp12 has a unique nidovirus RdRp-associated nucleotidyltransferase (NiRAN) domain that transfers nucleoside monophosphates to the Nsp9 protein and the nascent RNA. The NiRAN and RdRp modules form a dynamic interface distant from their catalytic sites, and both activities are essential for viral replication. We report that codon-optimized (for the pause-free translation in bacterial cells) Nsp12 exists in an inactive state in which NiRAN-RdRp interactions are broken, whereas translation by slow ribosomes and incubation with accessory Nsp7/8 subunits or nucleoside triphosphates (NTPs) partially rescue RdRp activity. This work shows that adenosine and remdesivir triphosphates promote the synthesis of A-less RNAs, as does ppGpp, while amino acid substitutions at the NiRAN-RdRp interface augment activation, suggesting that ligand binding to the NiRAN catalytic site modulates RdRp activity. The existence of allosterically linked nucleotidyl transferase sites that utilize the same substrates has important implications for understanding the mechanism of SARS-CoV-2 replication and the design of its inhibitors. During the last year, we also performed a comprehensive analysis of the genomes and proteins of Asgard archaea, the closest archaeal relatives of eukaryotes, the diversity of which was greatly expanded in our collaboration with the laboratory of Dr. Meng Li, of Shenzhen University. Our protein domain analysis using the 162 Asgard genomes results in a major expansion of the set of eukaryotic signature proteins. The Asgard eukaryotic signature proteins show variable phyletic distributions and domain architectures, which is suggestive of dynamic evolution through horizontal gene transfer, gene loss, gene duplication and domain shuffling. The phylogenomics of the Asgard archaea points to the accumulation of the components of the mobile archaeal 'eukaryome' in the archaeal ancestor of eukaryotes (within or outside Asgard) through extensive horizontal gene transfer. In summary, over the year in review, our research on protein domains led to a substantial increase in the repertoire of domains encoded by viruses of prokaryotes and eukaryotes, and to insights into fundamental problems of evolutionary biology including the origin of eukaryotes. We also performed a study that may help the design of inhibitors of SARS-CoV-2 RNA polymerase.
View original record on NIH RePORTER →