LLM Meeting Decision Trees on Tabular Data

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address two key bottlenecks in applying large language models (LLMs) to tabular data—(i) poor generalizability and privacy risks of existing serialization methods, and (ii) high fine-tuning costs and context-length limitations of in-context learning—we propose DeLTa, the first framework that synergistically integrates LLMs with decision trees via logical rules as an intermediary. DeLTa eliminates table-to-text serialization, avoids LLM fine-tuning or in-context learning, and achieves end-to-end error correction by leveraging LLMs to rewrite and calibrate decision tree rules. Its core innovations are a rule-level error calibration mechanism and a zero-serialization modeling paradigm. Evaluated across diverse tabular benchmarks, DeLTa achieves state-of-the-art performance, significantly improving accuracy, robustness, and full-data learning capability while inherently mitigating privacy leakage risks.

Technology Category

Application Category

📝 Abstract

Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

LLMs lack universal data serialization for tabular data

LLM fine-tuning struggles with tabular data scalability

Privacy risks arise from direct textual exposure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs with decision tree rules

Avoids tabular data serialization privacy risks

Uses LLM to enhance decision tree accuracy

🔎 Similar Papers

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science