Genome Assemblies, Analyses, and Comparisons

$276,669ZIAFY2023LMNIH

National Library Of Medicine

Investigators

Linked publications & trials

Abstract

Development of PM4NGS, a project management framework for NGS data analysis: NGS data analysis has advanced the design, implementation, and execution of many complex computational biology pipelines. For computational biologists, pipelines are multi-step methods that should follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) data principles and guarantee reproducibility, portability and scalability. Workflow languages and managers, docker containers, and scientific computational notebooks have been adopted by the scientific community with the intention to improve reproducibility, portability, maintainability, and shareability of computational pipelines. Following these principles, our group has developed PM4NGS 6, a project management framework for NGS data analysis. This framework comprises the automatic creation of a standard organizational structure of directories and files; bioinformatics tool management, using Docker/Biocontainers or Conda/Bioconda ; data analysis pipelines in the Common Workflow Language (CWL) format; and pre-configured Jupyter notebooks with minimum Python code. The framework was designed as a fully interactive tool for data analysis on personal laptops or workstations. It also can be used as an educational tool to train new bioinformaticians on how to organize an NGS data analysis project that shows a detailed view of the pipeline components. PM4NGS currently includes four NGS data analysis workflows as templates: differential gene expression and GO enrichment analysis from RNA-Seq data; differential binding analysis from ChIP-Seq data; DNA motif binding detection from ChIP-exo data; and transcriptome assembly, including annotation and submission for unannotated organisms. These templates can be reused or modified to create new computational biology workflows. This framework aims to reduce the gap between researchers in experimental laboratories, producing NGS data, and the workflows for the data analysis. The complexity of working with multiple directories, data files, and programs on the Linux command line interface is managed completely by PM4NGS, allowing researchers to focus on result interpretation. De novo transcriptome assembly, annotation, and submission for UNANNOTATED organisms: We have developed a transcriptome assembly, annotation, and submission workflow for unannotated organisms. This workflow was implemented as a PM4NGS-based template and designed to run on the Google cloud platform (GCP). Users can run the PM4NGS Jupyter notebooks on their personal laptops or workstations and submit the more intense computing jobs to the GCP. As part of the development, a suitability study was published to demonstrate the benefits of using a public cloud provider for computational biology experiments 7. We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with 500,000 transcripts can be processed in less than 2 hours, with a computing cost of about $200$250. This workflow was used to assemble, annotate, and submit the Opuntia streptacantha transcriptome from the BioProject PRJNA320545 (a collaboration with scientists at Universidad Autnoma de San Luis Potos, Mexico). The workflow uses Trinity to assemble RNA-Seq raw reads into transcripts. The transcripts are clustered to create Trinity genes. Homologous sequences are identified to annotate the transcripts with GO terms, enzyme names, and conserved domains. The functional annotation for the Opuntia streptacantha transcriptome is published with additional information about the assembly and differential gene expression analysis of two experimental conditions at https://www.ncbi.nlm.nih.gov/research/nopaldb/. Detection and removal of foreign contamination on RNA-Seq samples: Our transcriptome assembly and annotation pipeline include a workflow to detect and remove foreign RNA contamination from the input samples. RNA-Seq contamination has played a large role in misleading multiple research conclusions. It is most troublesome if the target organism does not have a reference genome or annotation in public databases. We have developed GTax, a taxonomic structured database of genomic sequences that can be used with BLAST for taxonomic classification and contamination filtering. This approach efficiently detects and eliminates contaminant reads in RNA-Seq data. GTax genomic sequences were extracted from the NCBI Genome database, using Datasets. The database includes a subset of the latest assemblies of a collection of reference genomes. Sequences were filtered by RefSeq Accession prefixes to reduce the size and possible contaminated sequences. The sequences were organized into 19 mutually exclusive and hierarchical taxonomic groups. For example, taxonomies in the Viridiplantae kingdom are divided into three GTax groups: the Liliopsida group, which includes all monocotyledon sequences; the Eudicotyledons group, which includes all dicotyledon sequences; and the Viridiplantae group, into which all of the other taxa in the Viridiplantae kingdom are placed. The same principle is applied to the Chordata phylum and all taxonomy groups from Neoteleostei to Sarcopterygii. Finally, all remaining Eukaryote taxa are placed in the Eukaryota taxonomy group. This taxonomic structured division of the genomic sequences in GTax keeps phylogenetically closely related species in the same taxonomy group and greatly reduces the size of the searchable BLAST database. The Sauropsida group, which is the biggest group and contains 1,073 sequences and 46,172,754,879 total bases, is only 6.84% of the NT database. Current version of GTax sequences represent 72.18% of the NT database. Our decontamination approach is initiated with a screening of the RNA-Seq reads (using BLAST) against the taxonomy group of the target organism. In these cases, we can screen millions of RNA-Seq reads against less than 6% of the NT database. Then, unidentified reads are screened against the remainder of the GTax taxonomy groups. Reads labeled as correct are those that match the taxonomy group of the target organism. Those that remain unidentified are labeled as such.

View original record on NIH RePORTER →