ATD: A Mixture Modeling Framework for Statistical Identification of Multiple Genomes in a Metagenomics Sample

$725,009FY2010MPSNSF

Northwestern University, Evanston IL

Investigators

Hongmei Jiangcontact Lingling An Simon M Lin

Abstract

The next generation sequencing technology has enabled the rapid sequencing of mixed genomes directly sampled from the environment, which is recently emerged as metagenomics. By direct sequencing, researchers can study organisms that are not easily cultured or even cannot be cultured at all in the laboratory. Based on the sequence data from a metagenomic sample the basic questions will be addressed include "what species or genomes are there?", "what are their relative abundance?", and "how many more species will be detected if more sequence reads are obtained?" The investigators propose to incorporate computational and statistical thinking concepts in modeling metagenomics sequencing data and estimating the multiple genomes and their relative abundance within a metagenomics sample. In particular, the investigator and her colleagues propose to combine clustering method and mixture modeling framework to estimate the multiple genomes and their relative abundance in the metagenomics sample based on the hits from aligning the sequence reads to known reference sequences. This mixture modeling framework can be further extended to include fine-tuning parameters such as position-specific sequencing errors. Conventional statistical and computational methods and algorithms for computing the point estimate and confidence interval for the species richness in the sample are evaluated for metagenomics data and new methods will be developed if necessary. Metagenomics provides a tool to study the genetic materials which are directly recovered from a natural (such as soil and seawater) or a host-associated (such as human gut) community. Identifying the multiple genomes in a single data set is a challenging problem, particularly when the species are represented at vastly different abundance. The algorithms and methods developed in this proposal can be applied to metagenomics studies in different fields including human health, environment, agriculture, and identification of viruses in biological threats and infectious diseases. The statistical models and computational algorithms will be integrated into open-source R software and made publicly available for the community to enable other researchers to analyze their own metagenomics data. A post-doctoral fellow is trained in this project.

View original record on NSF Award Search →