Statistical Foundations of Model-Based Variable Clustering

$250,000FY2017MPSNSF

Cornell University, Ithaca NY

Investigators

Florentina Buneacontact Marten H Wegkamp

Abstract

The problem of variable clustering is a corner stone in a multitude of areas such as genetics, neuroscience, sociology, macroeconomics, to name a few. In neuroscience, it aids in finding new functionally connected areas. In genetics, it helps advance the discovery of genes with under-explored or unknown functions. In macro-economics it can assist with the creation of new economic indices. Despite its wide-spread importance and potential impact, this problem has not received a systematic methodological and theoretical treatment in the literature. Although clustering algorithms abound, and have a very long history, assessing the validity of their input is somewhat arbitrary. A probabilistic, model-based approach is put forward in this project. This will enable the development of a unified framework for principled statistical variable clustering. Specifically, this project will introduce and investigate classes of latent variable models for overlapping and non-overlapping variable clustering. The focal points are: (I) The introduction of identifiable latent variable models for clustering. This will provide well defined targets for estimation, and will facilitate the scientific interpretation of the clusters. (II) The development of polynomial time algorithms tailored to these models. (III) The creation of a unifying framework for the theoretical analysis of clustering algorithms, with emphasis on minimax optimality and high dimensional inference. (IV) The study of the impact of model based clustering algorithms on downstream analyses, with emphasis on graphical models, regression and classification.

View original record on NSF Award Search →