Collaborative Research: Identifying Structure in Social Data Models using Markov Chain Monte Carlo Algorithms

$162,497FY2010SBENSF

Washington University, Saint Louis MO

Investigators

Abstract

The analysis of social science data is often difficult for reasons that tend to affect other fields less substantially. One problem that is particularly difficult to handle with traditional statistical models is deliberately withheld information that correlates strongly with phenomena of interest. Such information can be thought of as unobserved clustering in the data. This project will substantially improve the current state of model-based clustering algorithms using Generalized Linear Mixed Dirichlet Models (GLMDM). The investigators' key objectives are to: (1) better understand unobserved clustering effects that are pervasive in social science datasets, notably with empirical studies of terrorism; (2) adapt GLMDM algorithms to provide substantive clusters of interest through posterior probabilities using covariate information; (3) develop an algorithmic approach that directly includes variable selection within clusters into a general clustering model; (4) speed up the simultaneous clustering and variable selection process by parallelization; and (5) distribute this technology as an easy-to-use R package for general use by others. This project will establish a new approach for using Bayesian nonparametric methods to produce clustering based on posterior probabilities. The development of nonparametric clustering algorithms is expected to substantially improve the current state of data clustering. The algorithmic developments, which will be disseminated widely, can be applied in any scientific field and will contribute to the statistical literature on Markov chain Monte Carlo. This new approach will be applied to the empirical study of terrorism. The project also will aid in the intellectual development of students and a post-doctorate researcher who will benefit from the project's interdisciplinary focus.

View original record on NSF Award Search →