CRII: CHS: Improving Data Exploration by Mining Analyst Behavior

$175,000FY2019CSENSF

Cornell University, Ithaca NY

Investigators

Abstract

Data analysts must explore increasingly large amounts of information. A variety of tools to visualize, filter, and model data help make it more manageable. Yet they bring with them a chance of errors of interpretation, omission, or cognitive bias. This project develops techniques for making data analysis tools more resistant to such errors. The team will record and store logs of the processes many different analysts use to explore a variety of datasets. From these logs of analyst behavior, the team will create models that identify both potentially advantageous and risky working strategies. They will use this model, along with tools designed to detect biases, to build new data analysis tools. The new tools will shape the presentation of data to users and suggest exploration choices to help counter errors and biases. For example, if past successful analysts, when faced with a similar set of data, performed a particular sequence of operations, the tool might suggest that the current user pursue similar procedures. The tools and work will be deployed publicly, benefiting individuals who may be less familiar with data science. The ultimate goal of this research program is to construct data analytics technology that learns from experts' work in order to improve the experience of novice data scientists. In this project, the team will build a dataset of analyst behavior using crowdsourcing platforms. To identify commonalities among many different analysts and datasets, they will develop a vocabulary (or set of abstractions) of analysis actions grounded in existing literature on analytics tool design. These actions will be encapsulated as states in a probabilistic graphical model of the sequence of actions taken during analysis. The team will explore how different behavior abstractions influence the fitting of the probabilistic model. The team will create algorithms to match new analyst sessions to the existing models and identify potential targets for intervention (or coaching) points during the data exploration. The project will incorporate these interventions into existing data analysis tools as a proof of concept, exploring in laboratory studies how individuals respond to different strategies of intervention. In addition to developing specific models and tools around data analytics workflow, this project sets up a broader research agenda of mining interaction logs for deeper insights into analyst cognition and working strategies. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →