SHF: Small: Collaborative Research: Programming Tools for Adaptive Data Analysis

$152,696FY2020CSENSF

Trustees Of Boston University, Boston

Investigators

Abstract

False discovery, or overfitting, occurs when an empirical researcher draws a conclusion based on a dataset that does not generalize to new data. Although there are many statistical methods for preventing false discovery, most are designed for static data analysis, where a dataset is used only once. However, modern data analysis is adaptive, and often the same datasets are reused for multiple studies by multiple researchers. Adaptivity has been identified by statisticians as one cause of non-reproducible research, and this project?s broader significance and importance will be to begin addressing this problem. Specifically, this project will build a prototype programming tool for preventing false discovery arising from adaptive data analysis. The intellectual merits are to incorporate and extend recent theoretical advances on this problem into a programming framework that allows researchers to analyze datasets adaptively with robust guarantees that overfitting will not occur. The project builds on a surprising recent connection between differential privacy and false discovery, a robust statistical guarantee that emerged recently to protect the privacy of sensitive data. This line of work shows that when data is analyzed in a differentially private way, then false discoveries cannot occur. Differential privacy is also programmable, and allows complex differentially private algorithms to be built from simple components, so it is an ideal programming framework for adaptive data analysis. This project is extending existing differentially private programming frameworks to adaptive data analysis. The PIs are also developing new algorithmic and programming languages tools for adaptive data analysis, and incorporating them into the first prototype system for this application.

View original record on NSF Award Search →