Comparative Analysis Of Completely Sequenced Genomes
National Library Of Medicine
Investigators
Linked publications, trials & patents
Abstract
The rapidly growing database of completely and nearly completely sequenced genomes of bacteria, archaea, eukaryotes and viruses (millions of genomes already available and many more in progress) creates both extensive new opportunities and major new challenges for genome research. During the year in review, we performed a variety of studies that took advantage of the genomic information to establish fundamental principles of genome evolution and to investigate the evolution of biologically and medically important groups of organisms. Mining rapidly growing databases of metagenomes and metatranscriptomes has the potential of discovery of a broad variety of novel functional systems as well as novel groups of viruses and prokaryotes. In collaboration with the laboratory of Professor Feng Zhang (Broad Institute of MIT and Harvard), we discovered and analyzed a novel RNA-guided functional system in bacteria and phages. RNA-guided systems provide remarkable versatility, enabling diverse biological functions. Through iterative structural and sequence homology-based mining mining of extensive metagenomic data, starting with the guide RNA-interaction domain of Cas9, we identified a family of RNA-guided DNA-targeting proteins in phage and parasitic bacteria. Each of these systems consists of a tandem interspaced guide RNA (TIGR) array and a TIGR-associated (Tas) protein containing a nucleolar protein (Nop) domain, sometimes fused to HNH (TasH)- or RuvC (TasR)-nuclease domains. The TIGR arrays were shown to be processed into 36-nucleotide RNAs (tigRNAs) that direct sequence-specific DNA binding through a tandem-spacer targeting mechanism. TasR can be reprogrammed for precise DNA cleavage, including in human cells. The structure of TasR determined by cryo-electron microscopy reveals striking similarities to box C/D small nucleolar ribonucleoproteins and IS110 RNA-guided transposases, providing insights into the evolution of diverse RNA-guided systems. In a collaborative study with the laboratory of Dr. Gisela Storz (National Institute of Child Health and Human Development, NIH), we studied bacterial microproteomes. Microproteins encoded by small open reading frames comprise the "dark matter" of proteomes. Although microproteins have been detected in diverse organisms from all three domains of life, many more remain to be identified, and only a few have been functionally characterized. In this comprehensive study of intergenic small open reading frames (ismORFs, 15-70 codons) in 5,668 bacterial genomes of the family Enterobacteriaceae, we identify 67,297 clusters of ismORFs subject to purifying selection. Expression of tagged Escherichia coli microproteins is detected for 11 of the 16 tested, validating the predictions. Although the ismORFs mainly code for hydrophobic, potentially transmembrane, unstructured, or minimally structured microproteins, some globular folds, oligomeric structures, and possible interactions with proteins encoded by neighboring genes are predicted. Complete information on the predicted microprotein families, including evidence of transcription and translation, and structure predictions are made available as an easily searchable resource for investigation of microprotein functions and evolution. Another study was dedicated to the identification of bacterial operons that were captured by eukaryotic organisms via horizontal gene transfer. In prokaryotes, functionally linked genes are typically clustered into operons, which are transcribed into a single mRNA, providing for the coregulation of the production of the respective proteins, whereas eukaryotes generally lack operons. We explored the possibility that some prokaryotic operons persist in eukaryotic genomes after horizontal gene transfer (HGT) from bacteria. Extensive comparative analysis of prokaryote and eukaryote genomes revealed 33 gene pairs originating from bacterial operons, mostly encoding enzymes of the same metabolic pathways, and represented in distinct clades of fungi or amoebozoa. This amount of HGT is about an order of magnitude less than that observed for the respective individual genes. These operon fragments appear to be relatively recent acquisitions as indicated by their narrow phylogenetic spread and low intron density. In 20 of the 33 horizontally acquired operonic gene pairs, the genes are fused in the respective group of eukaryotes so that the encoded proteins become domains of a multifunctional protein ensuring coregulation and correct stoichiometry. We hypothesized that bacterial operons acquired via HGT initially persist in eukaryotic genomes under a neutral evolution regime and subsequently are either disrupted by genome rearrangement or undergo gene fusion which is then maintained by selection. Evolution of bacterial and archaeal genomes is highly dynamic, including extensive gene gain via horizontal gene transfer (HGT) and gene loss as well as different types of genome rearrangements, such as inversions and translocations, so that gene order is not highly conserved even among closely related organisms. We sought to quantify the contributions of different genome dynamics processes to the evolution of the gene order in prokaryote genomes, relying on the recently developed, simple, stochastic model of genome rearrangement through single gene translocations ("jump" model). The jump model was completely solved analytically in our previous work and provides the exact distribution of syntenic gene block lengths (SBL) in compared genomes based on gene translocations alone. Comparing the SBL distribution predicted by the jump model with the distributions empirically observed for multiple groups of closely related bacterial and archaeal genomes, we obtained robust estimates of the genome rearrangement to gene flux (gain and loss) ratio. In most groups of bacteria and archaea, this ratio was found to be on the order of 0.1 indicating that the loss of synteny in the evolution of bacteria and archaea is driven primarily by gene gain and loss rather than by gene translocation. An important part of this project is the study of the evolution of cancer genomes. Tumor evolution is shaped by selective pressures imposed by physiological factors as the tumor naturally progresses to colonize local and distant tissues, as well as by therapy. However, the distinction between these two types of pressures and their impact on tumor evolution remain elusive, mainly, due to extensive intra-tumor heterogeneity. To disentangle the effects of these selective pressures, we analyzed data from diverse cohorts of patients, of both treated and untreated cancers. We found that, despite the wide variation across patients, the selection strength on tumor genomes in individual patients is stable and largely unaffected by tumor progression in the primary settings, with some cancer-specific signatures detectable in the progression to metastases. However, we identified a nearly universal shift toward neutral evolution in tumors that resist treatment and demonstrate that this regime is associated with worse prognosis. We validated these findings on both published and original datasets. It is suggested that monitoring the selection regime during cancer treatment can assist clinical decision-making in many cases.
View original record on NIH RePORTER →