Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses a critical limitation in existing online decision trees, such as Hoeffding Trees, whose reliance on fixed-sample concentration inequalities renders them incompatible with data-dependent split-stopping rules, potentially driving the probability of erroneous splits arbitrarily close to one and undermining statistical guarantees. To overcome this, the paper introduces anytime-valid inference into online decision tree learning for the first time, proposing a splitting mechanism that operates effectively under arbitrary data streams—including non-stationary environments. The method rigorously controls the risk of false splits while ensuring each accepted split yields a statistically significant improvement in model performance, thereby guaranteeing monotonic risk reduction. Empirically, it produces more compact tree structures and consistently outperforms state-of-the-art approaches, both as a standalone learner and when integrated into adaptive random forests.

📝 Abstract

Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator across these methods is their reliance on Hoeffding Trees as base learners, which grow decision trees incrementally by testing whether a candidate split is significantly better than its alternatives using concentration inequalities. Despite their empirical success, existing variants lack valid statistical guarantees. Current analyses rely on fixed-sample concentration bounds, while split decisions are made using data-dependent stopping rules, which invalidates their guarantees and can drive the probabilty of incorrect splits to one. We introduce a principled alternative based on anytime-valid inference. Our method provides: (i) anytime-valid control of false splits under arbitrary data streams, including non-stationary settings; (ii) finite commitment time under a predictive advantage; and (iii) under stationary i.i.d. data, risk is monotone decreasing and strictly improves at every split. Empirically, we evaluate both standalone trees and their use within Adaptive Random Forests on non-stationary streams. Our method improves performance while producing substantially smaller trees.

Problem

Research questions and friction points this paper is trying to address.

online decision trees

split selection

statistical guarantees

anytime-valid inference

data streams

Innovation

Methods, ideas, or system contributions that make the work stand out.

anytime-valid inference

online decision trees

Hoeffding Trees