Statistical Methods for Analyzing Complex Structured and Count Data
University Of Washington, Seattle WA
Investigators
Abstract
How can treatment/intervention effects in a complex social environment be measured? How can single-cell data be used to diagnose complex diseases such as autism spectrum disorder? This project aims to address these questions by developing statistics and machine learning methods that enable robust, interpretable, efficient, and fast analysis of big datasets routinely produced in biology, neuroscience, social sciences, politics, and epidemiology. The project encompasses two main tracks: (1) structured analysis for large datasets, for which the goal is to devise methods that can make efficient use of the intrinsic structure of the possibly very high-dimensional data without having to estimate the structure first; and (2) analysis of large count datasets, for which the goal is to design robust nonparametric models and algorithms that can handle complex, likely heterogeneous, count data. The investigator also plans to mentor and support graduate and undergraduate students majoring in statistics and related fields and broaden the participation of underrepresented minority students. This project will advance the current state of knowledge in big structured and count data analyses by putting forward two main tracks of studies. The first is centered on random graph-based statistical inference through nearest neighbors (NN) or minimum spanning tree. Two main working examples in this track are NN matching for inferring the average treatment effect and graph-based correlation coefficients to infer marginal and conditional dependence strength. The investigator aims to revise and generalize these two families of methods to boost their efficiency while maintaining their robustness and computational speed. The second is centered on nonparametric univariate or multivariate Poisson mixture models. The investigator aims to bridge heterogeneous count-valued mixtures to nonparametric models (e.g., fully nonparametric, shape-constrained, nonnegative matrix factorization-based, etc.) under the umbrella of heterogeneous mixture model-based inference. The investigator will explore and settle several theory, method, computation, and application questions in the two tracks. Some preliminary results made in the first track have already stimulated new work in the causal inference community, and the results produced from the second track are expected to help with the early diagnosis of autism. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →