Software and Server for Taxonomic Binning of Metagenomic Sequences
J. Craig Venter Institute, Inc., La Jolla CA
Investigators
Abstract
The J. Craig Venter Institute is awarded a grant to develop MGTAXA, a freely available software and a Web server for taxonomic classification of metagenomic sequences with machine learning techniques. This project will build three major components: 1) a toolbox for reliable assignment of species composition to large collections of unassembled environmental sequencing data, with automated and regular updates of databases and models; 2) a public Web server with a high-performance computational back-end that will let a wide community of biologists build classification models specific to their metagenomic samples; 3) an online instructional environment where students and educators will interactively combine several machine learning algorithms into graphically represented pipelines, apply them to sequences from annotated genomes and contribute to the re-usable repository of exercises and small research projects. The tools developed by this project will help both individual biologists and experienced bioinformatics teams analyze their metagenomic data for the discovery of novel genes, proteins, and metabolic pathways in microorganisms that cannot be grown in the laboratory conditions. This basic scientific research of our living environment will ultimately benefit the public by providing a necessary foundation for applied areas of study such as alternative energy sources and new medicines. The first question that needs to be answered by any metagenomic study is what species or higher taxonomic units are present in the sample, and to bin individual sequences to these units. The novel methodology of this project will require neither an existing homology to known sequences nor a preliminary assembly of individual fragments into longer segments. It also frees its users from a complexity of data management and installation that is beyond the abilities of smaller research groups. The free interactive online learning interface will provide both a hands-on experience and a curriculum development tool for students and teachers from colleges and high-schools, regardless of their geographical location. Source code of the tools developed by this project will be available at the open source development site SourceForge (http://sourceforge.net/projects/mgtaxa/). Web services will be available through a variety of venues: the JCVI web site (http://www.jcvi.org/) and the TeraGrid (http://www.teragrid.org). Certain tools will be submitted for inclusion into existing bioinformatics services Galaxy (http://galaxy.psu.edu) and CAMERA (http://camera.calit2.net).
View original record on NSF Award Search →