Adaptive Thresholding for Hierarchical Clustering of Variables, with Connections to Scan Statistics

$150,000FY2016MPSNSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

In modern data analysis with large data sets, a common goal is to detect groups of variables that exhibit similar behavior. This task is usually referred to as clustering. In genetics and proteomics, for instance, clustering can reveal structures of scientific interest, such as potential biological pathways. On top of detecting scientifically relevant structure in the data, clustering can also be used to simplify data representations and analysis. One of the most widely used approaches to clustering is called hierarchical clustering. In hierarchical clustering, a measure of similarity, like correlation, is computed between each pair of variables, and then similar groups of variables are repeatedly merged. This leads to a fundamental question: how much grouping should be done? The proposed research consists of several projects aimed at developing broadly applicable methods for determining the appropriate amount of clustering, based on the degree of similarity present in the data. The resulting procedures will also provide statistical guarantees on the meaning of the resulting groups. This proposal aims to develop practical procedures for adaptive thresholding of hierarchical clustering dendrograms, when applied to pairwise similarities of variables. These procedures will be connected to inferential guarantees about the false cluster error rate of the resulting clustering. The results will target a range of common linkages and variable similarity measures. The PI will also demonstrate these procedures in a modern genetics application.To support these procedures, new theory will be developed describing the large order statistics of variable similarity measures, including new asymptotic bounds on their joint distributions and new finite-sample bounds on their maxima. The techniques proposed here will also have application to other threshold-based procedures in statistics; in particular, connections may be made between the proposed work and adaptive thresholding procedures for scan statistics.

View original record on NSF Award Search →