An Observation on Lloyd's k-Means Algorithm in High Dimensions
By: David Silva-Sánchez, Roy R. Lederman
Potential Business Impact:
Fixes computer grouping when data is messy.
Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.
Similar Papers
LocalKMeans: Convergence of Lloyd's Algorithm with Distributed Local Iterations
Machine Learning (Stat)
Makes computer learning faster on many machines.
Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search
Machine Learning (CS)
Finds patterns in huge groups of data faster.
Modified K-means Algorithm with Local Optimality Guarantees
Machine Learning (CS)
Makes computer groups more accurate and reliable.