III: Large: Collaborative Research: Analysis Engineering for Robust End-to-End Data Science
Carnegie Mellon University, Pittsburgh PA
Investigators
Abstract
From poor statistical practices leading to retractions of scientific "discoveries" to low-level spreadsheet errors subverting high-stakes analyses, failures of data analysis can have catastrophic consequences. The rapid growth of data science practice in the last decade has led to large collaborative efforts to develop new data processing, machine learning, and analytics tools that put more advanced data analysis into the hands of a wider audience of practitioners, from students to scientists to designers. The most dominant tool for data science is code, where cutting-edge algorithms can be applied from an existing libraries. However, as this democratization of data science has lowered the barrier to using advanced methods, safely using these tools under sound statistical practice remains as difficult as ever. To facilitate more robust data science, this project investigates models and tools for analysis engineering by data scientists who write programs. The focus is on the complete end-to-end process of data analysis performed with code: the iterative, and often exploratory, steps that analysts go through to turn data into This project will contribute insights and characterizations of analytic work, novel methods for capturing and analyzing data science activities, and develop new programming tools and visualization methods for authoring and validating analyses. If successful, this project will augment people's ability to conduct and assess data analyses, promoting more robust results and reducing the gap between novice and expert analysts. The findings and tools from the project will be incorporated into educational efforts, including classroom teaching and tutorials and available as open source software integrated into popular analytical environments (e.g., Jupyter). Data analysis is a central activity to scientific research, yet is too often conducted in an undisciplined fashion. This project treats the entire analytic process as our central phenomenon of study. The project will employ mixed methods to study and characterize common analysis practices and pitfalls, including direct observations of data analysts, large-scale analysis of computational notebooks, and instrumentation of analytic programming environments like JupyterLab. The project will contribute new methods for specifying and safeguarding analyses, including domain-specific languages and program synthesis methods to guide users to preferred next steps. It will also explore "multiverse" workflows to manage and assess a diversity of analysis decisions. Analogues of debugging and testing tools will be developed to flag problems and perform error analysis, while the capture and visualization of analytic provenance to aid reproducibility, verification, and collaborative review. The work will be evaluated through controlled studies, classroom use, and open-source deployment for wide-scale field use. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →