🤖 AI Summary
The necessity and efficacy of feature selection (FS) in high-dimensional gene expression classification remain widely assumed but insufficiently validated empirically. Many computational FS studies select genes without experimental verification, raising methodological concerns.
Method: We propose a hypothesis-testing framework to systematically compare the classification performance of small, randomly sampled feature subsets (0.02%–1% of total features) against those selected by classical FS algorithms and the full feature set.
Contribution/Results: On most benchmark gene expression datasets, classifiers trained on random feature subsets achieve accuracy comparable to—or even exceeding—that of FS-based and full-feature models. These findings challenge the prevailing assumption that FS inherently improves predictive performance. They expose potential methodological risks in computation-driven gene selection practices and provide empirical evidence urging critical reevaluation of FS conventions in biomedical feature engineering.
📝 Abstract
Extensive research has been done on feature selection (FS) algorithms for high-dimensional datasets aiming to improve model performance, reduce computational cost and identify features of interest. We test the null hypothesis of using randomly selected features to compare against features selected by FS algorithms to validate the performance of the latter. Our results show that FS on high-dimensional datasets (in particular gene expression) in classification tasks is not useful. We find that (1) models trained on small subsets (0.02%-1% of all features) of randomly selected features almost always perform comparably to those trained on all features, and (2) a "typical"- sized random subset provides comparable or superior performance to that of top-k features selected in various published studies. Thus, our work challenges many feature selection results on high dimensional datasets, particularly in computational genomics. It raises serious concerns about studies that propose drug design or targeted interventions based on computationally selected genes, without further validation in a wet lab.