SHF: Small: Program Analysis for Data Science

$499,671FY2019CSENSF

Northeastern University, Boston MA

Investigators

Abstract

Data Science is a discipline that combines computing with statistics with the aim of turning raw data into insights. Data analysis is typically performed by composing a series of discrete tools and libraries into a data-analysis pipeline. Faced with the increased velocity that these tools afford to researchers, one of the open questions we are facing is: can we trust the claimed results? As new studies are produced at an increasingly high rate, and the number of data-science practitioners keeps growing, it is unclear if the community has the resources to validate even a fraction of the research in print. The novelty of this project is to take some of the early steps towards the goal of trustworthy data analysis. The project's impact is to increase trust in the results of data analysis performed with technologies such as R and Spark. The project will also impact the broader community of R and Spark developers, by offering analysis tools that can be used widely within the community. The human impact of this project is in recruitment and retention of minority students to opportunities in data-analysis research, and in helping them prepare for careers in STEM. The first contribution of this project is to curate a corpus of data-analysis pipelines. This corpus will give researchers a window into the activities performed by practitioners. This, in and of itself, will be a valuable addition to the general understanding of data analytics. The second contribution of this project will be a set of dynamic- and static-analysis tools that will be used to find faults in data-analysis pipelines. Dynamic analysis will be used to gather behavioral data about the programs and libraries as well as catch latent bugs. Static-analysis techniques will be used to find coding idioms that are potentially buggy. One of the technical challenges that will be solved is how to analyze incomplete code. This challenge comes from the fact that the languages used to write data-analysis code are often dynamic and can load new code at any time. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →