Decision Tree Induction Through LLMs via Semantically-Aware Evolution

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Decision tree induction faces two key challenges: greedy algorithms often converge to suboptimal solutions, while exact optimization methods suffer from high computational cost and poor generalization. To address these issues, this paper proposes LLEGO—a large language model (LLM)-enhanced genetic programming framework. Its core innovation lies in semantic-aware, LLM-driven genetic operators: fitness-guided crossover and diversity-guided mutation—integrating domain knowledge and semantic priors into the evolutionary process to dynamically balance exploration and exploitation. Through structured natural language prompt engineering, the LLM actively participates in conditional rule generation and tree-structure refinement. Evaluated across multiple benchmark datasets, LLEGO consistently outperforms state-of-the-art decision tree learning algorithms. The evolved trees exhibit superior generalization performance, and the search efficiency improves by over an order of magnitude compared to conventional genetic programming approaches.

Technology Category

Application Category

📝 Abstract
Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce $ exttt{LLEGO}$, a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce $ extit{fitness-guided}$ crossover to exploit high-performing regions, and $ extit{diversity-guided}$ mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that $ exttt{LLEGO}$ evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.
Problem

Research questions and friction points this paper is trying to address.

Overcome limitations of greedy and exact optimization methods in decision tree induction.
Integrate semantic priors and domain knowledge into genetic programming for better decision trees.
Enhance search efficiency and generalization using LLMs in evolutionary optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates semantic priors via LLMs
Uses fitness-guided crossover for optimization
Employs diversity-guided mutation for exploration
🔎 Similar Papers
No similar papers found.
Tennison Liu
Tennison Liu
University of Cambridge
Machine LearningBrain-Machine InterfacesNeural Prostheses
N
Nicolas Huynh
DAMTP, University of Cambridge
M
M. Schaar
DAMTP, University of Cambridge