🤖 AI Summary
This work proposes a Weighted Sum-of-Trees model to address the limitation of traditional clustering-based prediction methods, which typically assume a single global model shared across all groups and thereby ignore inter-group heterogeneity. The proposed approach learns a dedicated decision tree for each group and combines their predictions through weights informed by cross-group similarity, effectively balancing group-specific characteristics with shared structural information—particularly beneficial for out-of-sample group prediction. By relaxing the restrictive assumption of parameter homogeneity inherent in existing models, this method enables flexible, group-specific modeling. Empirical evaluations demonstrate its superior performance over standard decision trees and random forests across multiple simulated scenarios, and its practical utility is further validated on sarcoma subtype data from The Cancer Genome Atlas.
📝 Abstract
Clustered data, which arise when observations are nested within groups, are incredibly common in clinical, education, and social science research. Traditionally, a linear mixed model, which includes random effects to account for within-group correlation, would be used to model the observed data and make new predictions on unseen data. Some work has been done to extend the mixed model approach beyond linear regression into more complex and non-parametric models, such as decision trees and random forests. However, existing methods are limited to using the global fixed effects for prediction on data from out-of-sample groups, effectively assuming that all clusters share a common outcome model. We propose a lightweight sum-of-trees model in which we learn a decision tree for each sample group. We combine the predictions from these trees using weights so that out-of-sample group predictions are more closely aligned with the most similar groups in the training data. This strategy also allows for inference on the similarity across groups in the outcome prediction model, as the unique tree structures and variable importances for each group can be directly compared. We show our model outperforms traditional decision trees and random forests in a variety of simulation settings. Finally, we showcase our method on real-world data from the sarcoma cohort of The Cancer Genome Atlas, where patient samples are grouped by sarcoma subtype.