Benchmarking of Clustering Validity Measures Revisited
By: Connor Simpson, Ricardo J. G. B. Campello, Elizabeth Stojanovski
Potential Business Impact:
Finds the best groups in data.
Validation plays a crucial role in the clustering process. Many different internal validity indexes exist for the purpose of determining the best clustering solution(s) from a given collection of candidates, e.g., as produced by different algorithms or different algorithm hyper-parameters. In this study, we present a comprehensive benchmark study of 26 internal validity indexes, which includes highly popular classic indexes as well as more recently developed ones. We adopted an enhanced revision of the methodology presented in Vendramin et al. (2010), developed here to address several shortcomings of this previous work. This overall new approach consists of three complementary custom-tailored evaluation sub-methodologies, each of which has been designed to assess specific aspects of an index's behaviour while preventing potential biases of the other sub-methodologies. Each sub-methodology features two complementary measures of performance, alongside mechanisms that allow for an in-depth investigation of more complex behaviours of the internal validity indexes under study. Additionally, a new collection of 16177 datasets has been produced, paired with eight widely-used clustering algorithms, for a wider applicability scope and representation of more diverse clustering scenarios.
Similar Papers
Measuring the Validity of Clustering Validation Datasets
Machine Learning (CS)
Helps computers find real groups in data.
Absolute indices for determining compactness, separability and number of clusters
Machine Learning (CS)
Finds the best groups in data.
The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models
Machine Learning (CS)
Makes computer learning results more trustworthy.