III: Medium: Bias Tracking and Reduction Methods for High-Dimensional Exploratory Visual Analysis and Selection

$1,097,598FY2017CSENSF

University Of North Carolina At Chapel Hill, Chapel Hill NC

Investigators

Abstract

Exploratory visualization and analysis of large and complex datasets is growing increasingly common across a range of domains. For example, online companies track users to learn about their products, computer security logs capture detailed traces of network activity, and health care systems capture detailed longitudinal records for their patients. In all of these fields, large and complex data repositories are being created with the goal supporting data-driven, evidence-based decision making. However, today's visualization tools -- a critical part of an analyst's toolbox -- are often overwhelmed when applied to high-dimensional datasets (i.e., datasets with large numbers of variables). Real-world datasets can often have many thousands of variables; a stark contrast to the much smaller number of dimensions supported by most visualizations. This gap in dimensionality puts the validity of any analysis at great risk of bias, potentially leading to serious, hidden errors. This research project will develop a new approach to high-dimensional exploratory visualization that will help detect and reduce selection bias and other problems with data interpretation during exploratory high-dimensional data visualization. The project's results, including open-source software, will be broadly applicable across domains. In addition, the project will be evaluated with users in a health outcomes research setting. This offers significant potential to improve health care around the world. This project develops a set of Contextual Visualization Methods for exploratory data analysis which are designed to support the discovery of more robust and generalizable insights from high-dimensional data. These methods are built upon a recognition that the very summarization that makes many visual methods effective also inherently obscures aspects of a high-dimensional dataset that may be critical to accurate interpretation of a user's visual findings. More specifically, the subset of data (comprising both dimensions and records) that is actively accounted for within a visualization -- the data focus -- must be interpreted within the context of the many dimensions and data records that have been omitted or are not clearly represented within a visualization--the data context. The methods that this project develops, therefore, are designed to (1) explicitly model and analyze the data context, and (2) convey the relationship between the data focus and the context in order to better inform users about hidden problems such as confounding variables and selection bias. The primary technical contributions of the project include: (1) inline replication for visual validation; (2) baselined selection methods for high-dimensional visualization; (3) interactive rebalancing for representative visualization. In addition, open-source software will be developed and evaluated with real-world data and practitioners. The products of this research project -- including new methods, software products, and evaluation results -- will be disseminated through a project website (https://vaclab.web.unc.edu/contextual-visualization/).

View original record on NSF Award Search →