Kernel Representation and Similarity Measure for Incomplete Data
By: Yang Cao , Sikun Yang , Kai He and more
Potential Business Impact:
Finds patterns in messy, missing information.
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at https://anonymous.4open.science/r/proximity-kernel-2289.
Similar Papers
A PCA-based Data Prediction Method
Machine Learning (CS)
Fills in missing numbers in data sets.
An Interdisciplinary and Cross-Task Review on Missing Data Imputation
Machine Learning (Stat)
Fixes broken data for better computer decisions.
Clustering Approaches for Mixed-Type Data: A Comparative Study
Machine Learning (Stat)
Finds patterns in mixed-type data.