Finding Protein Sequence Motifs--Methods And Applications
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods as well as protein structure prediction. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, modeling of protein structures using the rapidly progressing AI-based tools, protein structure comparison methods, and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. During the year under review, we have continued our investigation of protein domains, particularly, those that are encoded in the genomes of viruses of prokaryotes and eukaryotes as well as domains involved in the defense of bacteria against viruses. We also developed a novel methodology for modeling protein fold evolution from random polypeptide sequences using a combination of ESMFold and AlphaFold3. The scope of our studies on protein domains and motifs was substantially expanded through the extensive use of AI-based methods for structure analysis. In several studies under this project, we expanded our ongoing exploration of the structure, functionality and evolution of prokaryotic antivirus defense mechanisms, in particular, CRISPR systems that are the source of powerful tools for genome engineering. Despite ongoing efforts to study CRISPR systems, the evolutionary origins of the reprogrammable RNA-guided mechanisms remain poorly understood. In collaboration with the laboratory of Professor Feng Zhang (Broad Institute of MIT and Harvard), we employed an integrated sequence/structure evolutionary tracing approach to identify the ancestors of the RNA-targeting CRISPR-Cas13 system. We find that Cas13 likely evolved from AbiF, which is encoded by an abortive infection-linked gene that is stably associated with a conserved non-coding RNA (ncRNA). We further characterize a miniature Cas13, classified as Cas13e, which appears to be an evolutionary intermediate between AbiF and other known Cas13s. Despite this relationship, we show that the functions of AbiF and Cas13e differ substantially. Whereas Cas13e is an RNA-guided RNA-targeting system, AbiF is a toxin-antitoxin (TA) system with an RNA antitoxin. As part of this work, the structure of AbiF was solved using cryoelectron microscopy, revealing basic structural alterations that set Cas13s apart from AbiF. Finally, we mapped the key structural changes that enabled a non-guided TA system to evolve into an RNA-guided CRISPR system. In another collaboration with Professor Feng Zhang's laboratory, we identified a novel class of RNA-guided proteins in phages and bacteria. RNA-guided systems provide remarkable versatility, enabling diverse biological functions. Through iterative structural and sequence homology-based mining starting with the guide RNA-interaction domain of Cas9, a family of RNA-guided DNA-targeting proteins encoded in phage and bacterial genomes was discovered. Each of the identified systems consists of a tandem interspaced guide RNA (TIGR) array and a TIGR-associated (Tas) protein containing a nucleolar protein (Nop) domain, sometimes fused to HNH (TasH)- or RuvC (TasR)-nuclease domains. It was shown that TIGR arrays are processed into 36-nucleotide RNAs (tigRNAs) that direct sequence-specific DNA binding through a tandem-spacer targeting mechanism. TasR can be reprogrammed for precise DNA cleavage, including in human cells. The structure of TasR reveals striking similarities to box C/D small nucleolar ribonucleoproteins and IS110 RNA-guided transposases, providing insights into the evolution of diverse RNA-guided systems. The origin and evolution of protein folds are among the most challenging, long-standing problems in biology. We developed Protein Fold Evolution Simulator (PFES), a computational approach that simulates evolution of globular folds from random amino acid sequences with atomistic details. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Iteration of this process allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how stable, globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as emergence of distinct folds not known to exist in nature. We show that evolution of small globular protein folds from random sequences, on average, takes 1.15 to 3 amino acid replacements per site, depending on the population size, with some simulations yielding stable folds after as few as 0.2 replacements per site. These values are lower than the characteristic numbers of replacements in conserved proteins during the time since the Last Universal Common Ancestor, suggesting that simple protein folds can evolve from random sequences relatively easily and quickly. PFES tracks the complete evolutionary history from simulations and can be used to test hypotheses on protein fold evolution.
View original record on NIH RePORTER →