A novel k-means clustering approach using two distance measures for Gaussian data
By: Naitik Gada
Potential Business Impact:
Finds hidden patterns in messy information better.
Clustering algorithms have long been the topic of research, representing the more popular side of unsupervised learning. Since clustering analysis is one of the best ways to find some clarity and structure within raw data, this paper explores a novel approach to \textit{k}-means clustering. Here we present a \textit{k}-means clustering algorithm that takes both the within cluster distance (WCD) and the inter cluster distance (ICD) as the distance metric to cluster the data into \emph{k} clusters pre-determined by the Calinski-Harabasz criterion in order to provide a more robust output for the clustering analysis. The idea with this approach is that by including both the measurement metrics, the convergence of the data into their clusters becomes solidified and more robust. We run the algorithm with some synthetically produced data and also some benchmark data sets obtained from the UCI repository. The results show that the convergence of the data into their respective clusters is more accurate by using both WCD and ICD measurement metrics. The algorithm is also better at clustering the outliers into their true clusters as opposed to the traditional \textit{k} means method. We also address some interesting possible research topics that reveal themselves as we answer the questions we initially set out to address.
Similar Papers
Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning
Machine Learning (CS)
Finds better ways for computers to learn.
Clustering Approaches for Mixed-Type Data: A Comparative Study
Machine Learning (Stat)
Finds patterns in mixed-type data.
High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data
Machine Learning (Stat)
Finds best groups in messy, big data.