BIGDATA: F: DKM: Addressing the two V's of Veracity and Variety in Big Data

$1,000,000FY2014CSENSF

University Of Notre Dame, Notre Dame IN

Investigators

Nitesh Chawlacontact Dong Wang Thanuka L Wickramarathne

Abstract

Data of questionable quality have led to significantly negative economic and social impacts on organizations, leading to overrun in costs, lost revenue, and decreased efficiencies. The issues on data reliability, credibility, and provenance have become even more daunting when dealing with the variety of data, especially data that are not directly collected by an organization, but from the third-party sources such as social media, data brokers, and crowdsourcing. To address such issues, this project aims to develop a Data Valuation Engine (DVE) that solves the critical problem of data reliability, credibility and provenance, and provides accountability and quality processes right from data acquisition. The DVE leverages and innovates techniques in estimation theory, data fusion and machine learning to fill a critical gap in data accountability and quality, thereby providing a transformative step in countering the ubiquitous data quality issues found in almost every application domain from business to environment to health to national security. The DVE will be integrated in the Hadoop ecosystem and will be agnostic to the data source, application or analytics, and provided as a hosted solution to the community. The user will interact with DVE by providing the data sources and relevant data necessary to solve a problem. The DVE in this project will be developed in a largely application-independent manner. The key challenges to develop this engine include: (i) How to generate the data quality indication labels to score data sources and the content of data based on various factors such as reliability, credibility, uncertainty and confidence? (ii) How to integrate data from various sources with different labeled scores? (iii) How to robustly evaluate the proposed engine in a broad spectrum of applications that serve as a proxy of a variety of real-world scenarios? The research plan has been designed to synergistically address the above challenges with a robust evaluation plan. Given the generality of the proposed methods, models and system, the project will potentially impact variety of applications of science, engineering, and social science and have broad environmental, economic, and health benefits. The PIs will release open source software and applicable data. The PIs will also provide a hosted DVE platform for a broad user and participant base. This project is also providing students with greater exposure to the areas of big data analytics, cloud computing, data fusion and data mining, both in courses and research experiences.

View original record on NSF Award Search →