Algorithms and Software for the Assembly of Metagenomic Data

$2,764,928R01FY2025AINIH

Univ Of Maryland, College Park, College Park MD

Investigators

Linked publications & trials

Abstract

Project Summary The importance of the microbial communities inhabiting the human body cannot be understated, and currently, metagenomic sequencing is one of the primary ways through which these communities are being analyzed. While targeted assays exist, it is now widely accepted tha the analysis of the entire complement of genes found within the microbiome (which can only be obtained through metagenomic sequencing) is critical for understanding the role these microbes play in health and disease. An important analysis is metagenomic assemblyâthe process used to stitch together sequencing reads into contiguous DNA sequences (contigs) and metagenome-assembled genomes (MAGs). The reconstruction of full genomes can enable analyses of the strain structure of communities, association of antibiotic resistance or virulence genes with specific members of the community; and more. Thus, there is strong interest in the broader microbiome community in methods able to extract high-quality MAGs from metagenomic data. New technologies have made it easier to construct MAGs, however, significant challenges remain, particularly for low abundance organisms and when handling genomic variation between bacterial strains. This proposal will tackle such challenges, as follows. New assembly merging/reconciliation techniques will allow the integration of data across different sequencing technologies and metagenomic samples, thereby improving the recovery of the genomes of low abundance organisms. Such organisms are believed to play an important role in defining the community structure and through their impact on human health. A valuable tool for mining microbiome data sets is the identification of genomic segments shared across multiple samples, thus allowing their association with clinical parameters and distinguishing them from transient members of the microbiota. This project will develop new techniques for the discovery of conserved sequences, relying on statistical analyses of k-mers and graph analyses to improve accuracy while maintaining computational speed. The project will also address the fact that the genomes found in the samples may harbor a different combination of genes from the pangenome of a species than represented in any one of the reference genomes. Proposed is a new approach for using pangenome graphs to guide the assembly process, thus enabling the assembly process to automatically "mix and match" between the genomic content of the reference collection, while also improving efficiency by removing redundancy. Finally, this project proposes a novel graph-based paradigm for the discovery of bacteriophagesâviruses that infect bacteria and that play an important role in defining the structure of microbial communities. Phages are also being considered a promising avenue for anti-bacterial therapies as we face increased rates of antimicrobial resistance. The approaches developed in this project will be implemented in usable and open-source software tools, allowing both academic and industrial reuse, and the data generated will be made publicly available.

View original record on NIH RePORTER →