Standards and Cyberinfrastructure that Enable "Big-Data" Driven Discovery for Tree Crop Research

$2,983,307FY2016BIONSF

Washington State University, Pullman WA

Investigators

Doreen Maincontact Albert G Abbott Jill L Wegrzyn Margaret E Staton Sook Jung

Abstract

Trees are fundamental for life, providing essential oxygen, carbon remediation, habitat, lumber, shelter, energy, food and recreation. Contributing over $130 billion per year, tree crops are important to the US economy and are the economic backbone for many rural areas. Like all crops, they face increasing challenges from abiotic/biotic stresses including rapid climate change and disease. Providing access to high quality genotypic, phenotypic and environmental data and data-mining tools from a common, resource-efficient platform will enable interrogation of this data for basic and applied research purposes in ways currently not available. This project will create a model "ecosystem" of community databases that can inter-communicate, and provide big data analysis tools utilizing common controlled vocabularies. The significant investment in big data generation, cyberinfrastructure, and comprehensive semantic ontologies by federal agencies will be leveraged by this project to bring richly annotated datasets and enhanced computing capabilities to individual scientists. Adoption of these new capabilities will be promoted through educational online modules for "guided" workflow analysis and ontological curation that train scientists to effectively query existing data, upload new data, assign metadata, and perform custom analyses. It is anticipated that the outcomes of this project will accelerate both basic discovery and improvement of important agronomic and silvic traits in tree crops. In this way, it is the hope that this project will help raise public awareness of the critical importance of healthy trees to a productive, sustainable planet and the U.S. economy, and promote stewardship of these critical resources. Connecting high quality, curated, phenotypic and genotypic data with geo-location and environmental data will enable fundamental questions in tree biology to be elucidated. Providing access to these integrated datasets and the tools to interrogate them in a fully targeted manner, is best achieved through community databases where the crop curation expertise resides. The usage of standard ontologies, cross-site querying functionality and web-services driven interoperability with other database and resources will expand the utility of data from community databases in an unprecedented way. Tripal is an open-source, customizable, scalable, modular database platform designed to address the constraints and resource inefficiencies of legacy database systems. This project will both leverage and coordinate funded efforts to enhance or update tree crop databases (Genome Database for Rosaceae, Citrus Genome Database, TreeGene and Hardwood Genomics Web) to Tripal that will support cross-site communication, adoption of existing standards, and "big data" integration and analysis.

View original record on NSF Award Search →