Phylogenetic and computational methods for accurate and efficient analyses of large-scale metagenomics datasets
University Of California Berkeley, Berkeley CA
Investigators
Linked publications & trials
Abstract
Project Summary/Abstract The overall goal of this project is to use approaches from statistics and computer science to solve signiï¬cant chal- lenges in the analysis of metabarcode and metagenomics data. Metagenomics, the study of combined genomes of organisms present in a single community, is an emerging highly interdisciplinary ï¬eld that combines genomics, bioinformatics, systems biology, among other areas. Metagenomics has many applications to public health es- pecially in the areas of pathogen detection, human microbiome analysis, and biodiversity monitoring. The larger objective of this proposal is to leverage the use of the open source software, tronko, a fast approximate likelihood phylogenetic placement method that I developed for taxonomic classiï¬cation, which is the ï¬rst phylogenetic place- ment method that truly enables the use of large-scale reference databases and next generation sequencing data desired as queries. Tronko will be used to solve fundamental problems in analyses of metabarcode and metage- nomic data in addition to developing an application to analyses of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences that will greatly enhance the utility of environmental monitoring of SARS-CoV-2. The speciï¬c aims of this proposal are to (1) solve an important theoretical problem by applying a rigorous species delineation to assignment, (2) to apply tronko to solve an important practical problem of estimating the compo- sition of SARS-CoV-2 lineages in wastewater surveillance samples, and (3) to develop a rapid custom reference database builder for analyzing metabarcode and metagenomics data. For Aim 1, different phylogenetic groups have different variability in different parts of the tree, therefore, I plan to use Bayesian methods to estimate effec- tive population sizes locally to establish appropriate cut-off thresholds for species assignments in different parts of the phylogeny. Current methods use arbitrary thresholds for delineation of taxonomic groups and this method would provide an elegant solution to a long-standing limitation in species classiï¬cation. For Aim 2, SARS-CoV-2 monitoring of wastewater is an effective strategy for early detection of outbreaks. I plan to build a pipeline, and subsequently a web portal for researchers, that uses tronko to ï¬rst detect the virus within a wastewater sample then subsequently uses an expectation-maximization algorithm to estimate the proportions of viral strains. This aim would greatly aid public health researchers in assessing and managing the pandemic since no established methods are currently available for this type of analysis. For Aim 3, current custom reference database builders require weeks if not months of consecutive computational time in addition to access to a large amount of data storage. I propose to build a method which can be completed within a day. The method will perform in silico ampliï¬cation of primers and subsequently use the ampliï¬ed fragments in a kmer-based approach for identifying relevant sequences within a nucleotide database with utilization both across a network connection and a local database. Execution of these aims will solve important theoretical, practical, and computational problems in the ï¬eld of metagenomics.
View original record on NIH RePORTER →