GGrantIndex
← Search

BD Spokes: PLANNING: MIDWEST: Cyberinfrastructure to Enhance Data Quality and Support Reproducible Results in Sensor Originated Big Data

$99,938FY2016CSENSF

Purdue University, West Lafayette IN

Investigators

Abstract

Today, scientists within numerous biophysical and agricultural sciences utilize sensors to collect a multitude of data including weather, plant pathogens, energy distribution and water consumption, and apply data management techniques to derive additional knowledge. A critical requirement for the use of such data is to make sure that data do not have errors, are accurate, complete and up to date. Low quality data may be caused by a variety of real world issues including faulty equipment, software faults, or adverse environmental conditions. However manually validating data quality is not viable for very large sensor systems with high volumes of generated data, and delays in validation may result in losing data which is impossible to regenerate due to changing temporal conditions. Also in scientific research it is today critical that research processes be reproducible in order to validate research results and detect scientific frauds. Research reproducibility is particularly challenging for experiments that include sensors because re-creating the same physical conditions in which data has been captured may often be difficult if possible at all. This project aims at creating and fostering a multi-disciplinary community focusing on data quality and research results reproducibility for sensor-based experiments. Two workshops will be organized to define the foundational concepts and requirements for data quality and research reproducibility, and to identify related requirements for the development of suitable cyberinfrastructures (CI). Initial approaches will be tested within the CRIS system, developed at Purdue University, as a scalable CI, and advances will be demonstrated in an integrated pilot system and tested at the Purdue University ACRE research and education facility. The project will result in an understanding of the requirements that a CI must address in order to support data quality and research results reproducibility also when sensors are part of the CI. These requirements will lead to scientific advances in several areas including: data quality for big data; data quality assessment for sensor-originated data; management techniques for sensor-based experiments; extended metadata for research results reproducibility; scientific workflows systems. The project will also result in advances in the use of CI in the area of agriculture applications with respect to the use of sensors and to the quality of data. Our project will have a broader impact on any data-intensive application, especially applications characterized by big interrelated data, as data inter-relationships offer interesting opportunities for data quality. The project will contribute to the discussion on research results reproducibility in that it will show CI tools that can help reproducibility.

View original record on NSF Award Search →