🤖 AI Summary
This study addresses the core challenges in pan-cancer precise classification: high-dimensional transcriptomic data with feature redundancy, limited sample sizes, and the difficulty of distinguishing 12 closely related cancer subtypes (e.g., lung adenocarcinoma vs. squamous cell carcinoma; clear-cell vs. papillary renal carcinoma). We propose a multi-view vertical-block Boruta iterative feature selection framework integrated with a dual-path ensemble classifier (LR/SVM/XGBoost) combining majority voting and probability averaging. Our method synergistically incorporates gene-level biological priors and machine learning robustness. Evaluated via 10-fold cross-validation on TCGA’s 33-cancer classification task, it achieves 97.11% accuracy and an AUC of 0.9996; for the 12 ambiguous subtypes, accuracy exceeds 90%, significantly outperforming state-of-the-art approaches. To our knowledge, this is the first work to introduce vertical blocking into the Boruta framework, enabling simultaneous gains in both interpretability and predictive performance.
📝 Abstract
Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.