🤖 AI Summary
This study addresses the challenge of poor response to conventional therapies in non-small cell lung cancer (NSCLC) due to tumor heterogeneity by proposing an interpretable predictive framework that integrates multi-omics data with machine learning. Leveraging patient-specific molecular profiles, the approach employs an XGBoost regression model optimized via random search and cross-validation to accurately predict drug sensitivity (LN-IC50). Key predictive features are identified using SHAP values, and biological interpretability is enhanced through semantic explanations generated by the large language model DeepSeek. The model demonstrates superior predictive performance while successfully uncovering genes and pathways associated with drug response, thereby significantly improving both interpretability and clinical credibility of the results.
📝 Abstract
Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren't the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual's genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model's predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature's impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.