Improving clustering quality evaluation in noisy Gaussian mixtures
By: Renato Cordeiro de Amorim, Vladimir Makarenkov
Potential Business Impact:
Makes computer groups more accurate, even with bad data.
Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable.
Similar Papers
A Fast Iterative Robust Principal Component Analysis Method
Computational Engineering, Finance, and Science
Cleans messy data to find true patterns.
Technical note on Fisher Information for Robust Federated Cross-Validation
Machine Learning (CS)
Fixes AI learning when data is spread out.
On the double robustness of Conditional Feature Importance
Statistics Theory
Finds the most important clues in data.