Score: 1

On the (In)Significance of Feature Selection in High-Dimensional Datasets

Published: August 5, 2025 | arXiv ID: 2508.03593v2

By: Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakrabarti

Potential Business Impact:

Random data often predicts as well as chosen data.

Feature selection (FS) is assumed to improve predictive performance and identify meaningful features in high-dimensional datasets. Surprisingly, small random subsets of features (0.02-1%) match or outperform the predictive performance of both full feature sets and FS across 28 out of 30 diverse datasets (microarray, bulk and single-cell RNA-Seq, mass spectrometry, imaging, etc.). In short, any arbitrary set of features is as good as any other (with surprisingly low variance in results) - so how can a particular set of selected features be "important" if they perform no better than an arbitrary set? These results challenge the assumption that computationally selected features reliably capture meaningful signals, emphasizing the importance of rigorous validation before interpreting selected features as actionable, particularly in computational genomics.