🤖 AI Summary
The Cox proportional hazards (PH) model imposes restrictive linear and PH assumptions, limiting its applicability in settings with nonlinear effects or time-varying hazard ratios. Method: We systematically evaluate machine learning (ML) and deep learning (DL) models for survival analysis under non-PH and nonlinear scenarios. We construct an eight-model ensemble framework, incorporating four novel deep survival models designed for non-PH and nonlinearity; identify the Harrell C-index’s susceptibility to misleading discrimination assessment under non-PH; propose a joint evaluation paradigm using the Antolini C-index and Brier score; develop a data-driven model selection guideline based on sample size, nonlinearity strength, and degree of PH violation; and conduct benchmarking on synthetic and real-world datasets using our open-source, reproducible toolkit Survhive. Contribution/Results: Under non-PH or strong nonlinearity, specific ML/DL models significantly outperform Cox regression. The proposed joint metric reliably detects models with high discrimination but poor calibration—highlighting critical limitations of single-metric evaluation.
📝 Abstract
Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at https://github.com/compbiomed-unito/survhive.