High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data
By: Mohammed Baragilly, Hend Gabr
Potential Business Impact:
Finds best groups in messy, big data.
Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This paper proposes a new robust, nonparametric clustering validation framework, the High-Dimensional Between-Within Distance Median (HD-BWDM), which extends the recently introduced BWDM criterion to high-dimensional spaces. HD-BWDM integrates random projection and principal component analysis to mitigate the curse of dimensionality and applies trimmed clustering and medoid-based distances to ensure robustness against outliers. We derive theoretical results showing consistency and convergence under Johnson-Lindenstrauss embeddings. Extensive simulations demonstrate that HD-BWDM remains stable and interpretable under high-dimensional projections and contamination, providing a robust alternative to traditional centroid-based validation criteria. The proposed method provides a theoretically grounded, computationally efficient stopping rule for nonparametric clustering in modern high-dimensional applications.
Similar Papers
A novel k-means clustering approach using two distance measures for Gaussian data
Machine Learning (CS)
Finds hidden patterns in messy information better.
DOD: Detection of outliers in high dimensional data with distance of distances
Methodology
Finds strange data points in complex information.
Confidence Sets for Multidimensional Scaling
Statistics Theory
Finds hidden patterns in messy data.