XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently generating mixed-type tabular data by proposing XGenBoost, the first framework to effectively adapt XGBoost for generative modeling. The approach comprises two models: an XGBoost-driven denoising diffusion implicit model (DDIM) tailored for small datasets and a hierarchical autoregressive model designed for large-scale data. A key innovation is a Gaussian–multinomial joint diffusion mechanism that operates without one-hot encoding, combined with empirical quantile function-based dequantization and hierarchical classifiers to preserve the ordinal structure of numerical features. Evaluated across multiple benchmarks, XGenBoost consistently outperforms existing neural and tree-based generative models in generation quality while substantially reducing training costs.

Technology Category

Application Category

📝 Abstract
Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged correctly, can make for better generative models as well. As such, we present XGenBoost, a pair of generative models based on XGBoost: i) a Denoising Diffusion Implicit Model (DDIM) with XGBoost as score-estimator suited for smaller datasets, and ii) a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis. The architectures follow from the natural constraints imposed by tree-based learners, e.g., in the diffusion model, combining Gaussian and multinomial diffusion to leverage native categorical splits and avoid one-hot encoding while accurately modelling mixed data types. In the autoregressive model, we use a fixed-order factorization, a hierarchical classifier to impose ordinal inductive biases when modelling numerical features, and de-quantization based on empirical quantile functions to model the non-continuous nature of most real-world tabular datasets. Through two benchmarks, one containing smaller and the other larger datasets, we show that our proposed architectures outperform previous neural- and tree-based generative models for mixed-type tabular synthesis at lower training cost.
Problem

Research questions and friction points this paper is trying to address.

tabular data synthesis
mixed-type data
generative modeling
XGBoost
tree-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

XGBoost
tabular data synthesis
denoising diffusion
autoregressive modeling
mixed-type data
🔎 Similar Papers
No similar papers found.