On the (In)Significance of Feature Selection in High-Dimensional Datasets
By: Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakrabarti
Potential Business Impact:
Random data often predicts as well as chosen data.
Feature selection (FS) is assumed to improve predictive performance and identify meaningful features in high-dimensional datasets. Surprisingly, small random subsets of features (0.02-1%) match or outperform the predictive performance of both full feature sets and FS across 28 out of 30 diverse datasets (microarray, bulk and single-cell RNA-Seq, mass spectrometry, imaging, etc.). In short, any arbitrary set of features is as good as any other (with surprisingly low variance in results) - so how can a particular set of selected features be "important" if they perform no better than an arbitrary set? These results challenge the assumption that computationally selected features reliably capture meaningful signals, emphasizing the importance of rigorous validation before interpreting selected features as actionable, particularly in computational genomics.
Similar Papers
On the (In)Significance of Feature Selection in High-Dimensional Datasets
Machine Learning (CS)
Randomly picking data works as well as picking.
A Comparative Study of Feature Selection in Tsetlin Machines
Machine Learning (CS)
Helps computers understand important data patterns better.
Improving statistical learning methods via features selection without replacement sampling and random projection
Quantitative Methods
Finds cancer genes better for new treatments.