Genome Assemblies, Analyses, and Comparisons
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
PM4NGS: Next-Generation Sequencing (NGS) data analysis has advanced the design, implementation, and execution of many complex computational biology pipelines. For computational biologists, pipelines are multi-step methods (1) that should follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) Data Principles (2) and guarantee reproducibility, portability and scalability (3). Workflow languages and managers, docker containers, and scientific computational notebooks have been adopted by the scientific community with the intention to improve reproducibility, portability, maintainability, and shareability of computational pipelines. Following these principles, our group has developed: PM4NGS (4), a project management framework for next-generation sequencing data analysis. This framework is composed of an automatic creation of a standard organizational structure of directories and files, bioinformatics tool management using Docker/Biocontainers (5) or Conda/Bioconda (6), data analysis pipelines in Common Workflow Language (CWL) (7) format and pre-configured Jupyter notebooks with minimum Python code. The framework was designed as a fully interactive tool for data analysis on personal laptops or workstations. It can also be used as an educational tool to train new bioinformaticians on how to organize an NGS data analysis project showing a detailed view of the pipeline components. PM4NGS currently includes four NGS data analysis workflows as templates: 1) differential gene expression and GO enrichment analysis from RNA-Seq data, 2) differential binding analysis from ChIP-Seq data, 3) DNA motif binding detection from ChIP-exo data and 4) transcriptome assembly, annotation and submission for unannotated organisms. These templates can be reused or modified by anyone to create a new computational biology workflows. Transcriptome assembly, annotation and submission for unannotated organisms: The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to industry-leading commercial cloud providers, such as Amazon Web Services (AWS; Seattle, WA, USA) and Google Cloud Platform (GCP; Mountain View, CA, USA). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome assembly and annotation which are complex analytical processes that requires the integration of multiple biological databases and several advanced computational tools. Our group has developed a transcriptome assembly, annotation and submission workflow for unannotated organisms. This workflow was implemented as a PM4NGS based template and designed to run on GCP. User can run the PM4NGS Jupyter notebooks on their personal laptops or workstations and submit the computing jobs to GCP. As part of the development, a suitability study was published to demonstrate the benefits of using a public cloud provider for computational biology experiments (8). We demonstrate that the public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with 500,000 transcripts can be processed in less than 2 hours with a compute cost of about 200-250 USD. This workflow was used to assembly, annotate and submit the Opuntia streptacantha transcriptome from the BioProject PRJNA320545. The workflow uses Trinity (9) to assembly RNA-Seq raw reads on transcripts. Later, those transcripts are clustered to create Trinity genes. Close protein homologous sequences are identified using BLASTP searches that are used to annotate the transcripts with GO terms, Enzyme names and Conservative Domains. Table 1 shows the functional annotation for the Opuntia streptacantha transcriptome. The GenBank record is public available at: https://www.ncbi.nlm.nih.gov/nuccore/GISG00000000.3 Table 1: Summary of functional annotation of Trinity genes and transcripts of Opuntia streptacantha. Total BLASTp BLAST GO Enzyme CDD Trinity genes 129,026 22,342 20,744 8,145 27,173 Transcripts 266,658 94,740 88,988 35,017 102,015
View original record on NIH RePORTER →