CAREER: Integrating Optimal Design and Inference for Modern Observational Studies

$437,422FY2022SBENSF

University Of California-Berkeley, Berkeley CA

Investigators

Abstract

This research project will develop new methods for inference about causal relationships in large administrative datasets. These datasets are an increasingly important source of evidence about causal effects in health services, public policy, and the social sciences. Procedures for measuring the causal effect of a treatment of interest emphasize creating similar subgroups of individuals, one receiving treatment and another receiving control. In practice, however, this process is not able to achieve perfect similarity in large administrative datasets, especially when the units in the study exhibit structure over time or space. The methods to be developed will produce confidence intervals and hypothesis tests that account explicitly for imperfect design and structure over units. Careful analysis of these methods will lead to valuable guidance for how to make initial designs less imperfect. The resulting tools will pair effectively with modern machine learning methods in a modular framework and will immediately be applicable to large-scale studies of health and educational outcomes. Supported educational activities will engage undergraduate researchers from diverse backgrounds, groom graduate students for future roles as faculty mentors, and produce pedagogical materials valuable for training undergraduate students in the principles of causal inference. Open-source software also will be developed. This project will focus on integrating design and inference for two widely used causal inference designs that attempt to create credible comparisons from initially different treatment and control samples: matching, which groups similar treated and control individuals together into small homogenous matched sets; and weighting, which constructs weights for study units with the goal of downplaying dissimilarities and emphasizing similarities between the groups. Four specific tasks will be undertaken. First, existing methods of permutation inference for matched designs, which reshuffle labels for treated and control units within matched groups to construct hypothesis tests, will be transformed by allowing permutation probabilities to vary according to the degree of remaining discrepancy in individual matched sets. The resulting method, which will be easy to implement using estimated probabilities of treatment, permits sensitivity analysis for unobserved variables and suggests a new method of choosing an initial match that effectively manages tradeoffs between similarity on probability of treatment and similarity on outcome risk. Second, new algorithms will be designed to efficiently sample from conditional permutation distributions that respect design constraints in matching including optimal pairing on estimated probabilities of treatment and constrained imbalance on multiple variables. These tools will lead to reduced bias and improved precision by paying attention to aspects of the study's design. Third, tools for matched permutation inference will be extended to clustered observational studies with treatment given at both cluster and individual levels and possible spillover effects. An accompanying sensitivity analysis that addresses unobserved variables at both individual and cluster levels also will be constructed. Finally, a new measure for large-sample performance of weighting methods in the presence of unobserved variables will be constructed, quantifying the impact of design choices on robustness to bias from unobserved variables. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →