Hierarchical Clustering With Confidence
By: Di Wu, Jacob Bien, Snigdha Panigrahi
Potential Business Impact:
Makes computer groupings more trustworthy and reliable.
Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive $α$-spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm.
Similar Papers
Hierarchical Linkage Clustering Beyond Binary Trees and Ultrametrics
Machine Learning (CS)
Finds hidden groups in information, even if none exist.
Reclustering: A New Method to Test the Appropriate Level of Clustering
Methodology
Finds the best way to group data for analysis.
Learning-Augmented Hierarchical Clustering
Data Structures and Algorithms
Helps group similar things by asking smart questions.