Why Can't I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction Validation
By: Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich
Potential Business Impact:
Finds hidden patterns in data faster.
Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure. However, such a structure may not always appear in the projections. Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing. Visual Analytics solutions can help, but are often time-consuming due to the large hyperparameter space. This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly. We introduce two supervised metrics, precision and recall, to evaluate the relationship phase. These metrics quantify how well the modeled relationships align with an expected cluster structure based on some set of labels representing this structure. We illustrate their application using t-SNE and UMAP, and validate the approach through various usage scenarios. Our approach can guide hyperparameter tuning, uncover projection artifacts, and determine if the expected structure is captured in the relationships, making the DR process faster and more reliable.
Similar Papers
Metric Design != Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction
Machine Learning (CS)
Cleans up how we check computer data pictures.
Unveiling High-dimensional Backstage: A Survey for Reliable Visual Analytics with Dimensionality Reduction
Human-Computer Interaction
Helps people trust computer pictures of data.
Mind the Gaps: Measuring Visual Artifacts in Dimensionality Reduction
Machine Learning (CS)
Shows if data pictures are misleading.