🤖 AI Summary
In population-based cohort studies, right-censored survival data often exhibit left truncation and length-biased sampling, leading to selection bias, biased variable selection, and suboptimal prediction accuracy when conventional survival analysis methods ignore the truncation-time distribution. To address this, we propose a novel survival tree and forest methodology grounded in a conditional inference framework. Our approach employs an unbiased score function derived from the full likelihood for robust variable selection and incorporates truncation-time distribution information to construct two complementary survival function estimators, thereby enhancing predictive accuracy. We further introduce a permutation-based test statistic to ensure reliable statistical inference and support both nonparametric and semiparametric modeling. Simulation studies and real-data analysis of lung cancer cohorts demonstrate that our method significantly outperforms existing left-truncation–aware approaches in terms of tree structure identification, variable selection accuracy, and survival prediction performance.
📝 Abstract
Left-truncated survival data commonly arise in prevalent cohort studies, where only individuals who have experienced disease onset and survived until enrollment in the study. When the onset process follows a stationary Poisson process, the resulting data are length-biased. This sampling mechanism induces a selection bias towards longer survival individuals, and nonparametric and semiparametric methods for traditional survival data are not directly applicable. While tree-based methods developed for left-truncated data can be applied, they may be inefficient for length-biased data, as they do not account for the distribution of truncation times. To address this, we propose new survival trees and forests for length-biased right-censored data within the conditional inference framework. Our approach uses a score function derived from the full likelihood to construct permutation test statistics for unbiased variable selection. For survival prediction, we consider two estimators of the unbiased survival function, differing in statistical efficiency and computational complexity. These elements enhance efficiency in tree construction and improve accuracy of survival prediction in ensemble settings. Simulation studies demonstrate efficiency gains in both tree recovery and survival prediction, often exceeding the gains from ensembling alone. We further illustrate the utility of the proposed methods using lung cancer data from the Cancer Public Library Database.