🤖 AI Summary
Decision trees inherently trade off interpretability and efficiency: greedy algorithms (e.g., CART) are computationally efficient but yield suboptimal solutions; globally optimal methods achieve higher accuracy but suffer from prohibitive computational cost, limiting scalability to deep trees or high-dimensional feature spaces. This paper proposes Dynamic Programming Decision Trees (DPDT), the first framework to formulate decision tree learning as a Markov Decision Process (MDP). DPDT integrates heuristic-based split-space pruning with dynamic programming to jointly minimize a regularized training loss. Crucially, DPDT guarantees a theoretical lower bound on performance no worse than CART, while reducing computational complexity by several orders of magnitude compared to state-of-the-art optimal solvers. Empirically, DPDT achieves near-globally-optimal training loss across multiple benchmark datasets and attains statistically significant improvements in generalization accuracy—consistently outperforming both CART and existing optimal tree methods.
📝 Abstract
In supervised learning, decision trees are valued for their interpretability and performance. While greedy decision tree algorithms like CART remain widely used due to their computational efficiency, they often produce sub-optimal solutions with respect to a regularized training loss. Conversely, optimal decision tree methods can find better solutions but are computationally intensive and typically limited to shallow trees or binary features. We present Dynamic Programming Decision Trees (DPDT), a framework that bridges the gap between greedy and optimal approaches. DPDT relies on a Markov Decision Process formulation combined with heuristic split generation to construct near-optimal decision trees with significantly reduced computational complexity. Our approach dynamically limits the set of admissible splits at each node while directly optimizing the tree regularized training loss. Theoretical analysis demonstrates that DPDT can minimize regularized training losses at least as well as CART. Our empirical study shows on multiple datasets that DPDT achieves near-optimal loss with orders of magnitude fewer operations than existing optimal solvers. More importantly, extensive benchmarking suggests statistically significant improvements of DPDT over both CART and optimal decision trees in terms of generalization to unseen data. We demonstrate DPDT practicality through applications to boosting, where it consistently outperforms baselines. Our framework provides a promising direction for developing efficient, near-optimal decision tree algorithms that scale to practical applications.