Finding Protein Sequence Motifs--Methods And Applications

$426,695ZIAFY2025LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Paper 39023254 Paper 38832788 Paper 38805295 Paper 38739430 Paper 38698035 Paper 38657076 Paper 38380930 Paper 38182597 Paper 37138077 Paper 37017580 Paper 36696902 Paper 35951700 Paper 35896753 Paper 35848484 Paper 35760839 Paper 35746792 Paper 35638784 Paper 35466371 Paper 35402511 Paper 35289643 Paper 34413232 Paper 34253028 Paper 34028251 Paper 33911286 Paper 32728052 Paper 32032510 Paper 31857715 Paper 31740763 Paper 31165781 Paper 31089700 Paper 31064832 Paper 30993331 Paper 30773816 Paper 30733291 Paper 30710061 Paper 29925949 Paper 29784811 Paper 29636073 Paper 29507349 Paper 29360740 Paper 29263101 Paper 29179671 Paper 29175107 Paper 29133882 Paper 28937734 Paper 28694999 Paper 28657885 Paper 28605718 Paper 28545555 Paper 28356531 Paper 28265094 Paper 28187792 Paper 28111461 Paper 28065598 Paper 27493190 Paper 27466388 Paper 27256883 Paper 27236306 Paper 27199977 Paper 27114038 Paper 26836982 Paper 26712934 Paper 26593719 Paper 26560305 Paper 26514828 Paper 26432522 Paper 26422227 Paper 26411297 Paper 26136578 Paper 26103305 Paper 26095544 Paper 26077867 Paper 26071768 Paper 26071590 Paper 25981466 Paper 25928409 Paper 25927823 Paper 25909276 Paper 25902496 Paper 25884386 Paper 25840414 Paper 25764277 Paper 25727355 Paper 25583072 Paper 25534808 Paper 25488578 Paper 25428365 Paper 25374149 Paper 25192263 Paper 25113822 Paper 25101062 Paper 25036622 Paper 24939392 Paper 24884953 Paper 24817877 Paper 24792168 Paper 24773695 Paper 24728998 Paper 24351931 Paper 24256226

Abstract

The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods as well as protein structure prediction. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, modeling of protein structures using the rapidly progressing AI-based tools, protein structure comparison methods, and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. During the year under review, we have continued our investigation of protein domains, particularly, those that are encoded in the genomes of viruses of prokaryotes and eukaryotes as well as domains involved in the defense of bacteria against viruses. We also developed a novel methodology for modeling protein fold evolution from random polypeptide sequences using a combination of ESMFold and AlphaFold3. The scope of our studies on protein domains and motifs was substantially expanded through the extensive use of AI-based methods for structure analysis. In several studies under this project, we expanded our ongoing exploration of the structure, functionality and evolution of prokaryotic antivirus defense mechanisms, in particular, CRISPR systems that are the source of powerful tools for genome engineering. Despite ongoing efforts to study CRISPR systems, the evolutionary origins of the reprogrammable RNA-guided mechanisms remain poorly understood. In collaboration with the laboratory of Professor Feng Zhang (Broad Institute of MIT and Harvard), we employed an integrated sequence/structure evolutionary tracing approach to identify the ancestors of the RNA-targeting CRISPR-Cas13 system. We find that Cas13 likely evolved from AbiF, which is encoded by an abortive infection-linked gene that is stably associated with a conserved non-coding RNA (ncRNA). We further characterize a miniature Cas13, classified as Cas13e, which appears to be an evolutionary intermediate between AbiF and other known Cas13s. Despite this relationship, we show that the functions of AbiF and Cas13e differ substantially. Whereas Cas13e is an RNA-guided RNA-targeting system, AbiF is a toxin-antitoxin (TA) system with an RNA antitoxin. As part of this work, the structure of AbiF was solved using cryoelectron microscopy, revealing basic structural alterations that set Cas13s apart from AbiF. Finally, we mapped the key structural changes that enabled a non-guided TA system to evolve into an RNA-guided CRISPR system. In another collaboration with Professor Feng Zhang's laboratory, we identified a novel class of RNA-guided proteins in phages and bacteria. RNA-guided systems provide remarkable versatility, enabling diverse biological functions. Through iterative structural and sequence homology-based mining starting with the guide RNA-interaction domain of Cas9, a family of RNA-guided DNA-targeting proteins encoded in phage and bacterial genomes was discovered. Each of the identified systems consists of a tandem interspaced guide RNA (TIGR) array and a TIGR-associated (Tas) protein containing a nucleolar protein (Nop) domain, sometimes fused to HNH (TasH)- or RuvC (TasR)-nuclease domains. It was shown that TIGR arrays are processed into 36-nucleotide RNAs (tigRNAs) that direct sequence-specific DNA binding through a tandem-spacer targeting mechanism. TasR can be reprogrammed for precise DNA cleavage, including in human cells. The structure of TasR reveals striking similarities to box C/D small nucleolar ribonucleoproteins and IS110 RNA-guided transposases, providing insights into the evolution of diverse RNA-guided systems. The origin and evolution of protein folds are among the most challenging, long-standing problems in biology. We developed Protein Fold Evolution Simulator (PFES), a computational approach that simulates evolution of globular folds from random amino acid sequences with atomistic details. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Iteration of this process allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how stable, globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as emergence of distinct folds not known to exist in nature. We show that evolution of small globular protein folds from random sequences, on average, takes 1.15 to 3 amino acid replacements per site, depending on the population size, with some simulations yielding stable folds after as few as 0.2 replacements per site. These values are lower than the characteristic numbers of replacements in conserved proteins during the time since the Last Universal Common Ancestor, suggesting that simple protein folds can evolve from random sequences relatively easily and quickly. PFES tracks the complete evolutionary history from simulations and can be used to test hypotheses on protein fold evolution.

View original record on NIH RePORTER →