Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

📅 2024-07-05
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing deep learning models—such as standard MLPs—for medium-to-large-scale tabular classification and regression often suffer from suboptimal accuracy, slow inference, or heavy dependence on labor-intensive hyperparameter tuning. To address this, we propose RealMLP: an efficient, pre-tuned MLP architecture specifically designed for tabular data, coupled with a novel meta-trained default hyperparameter configuration derived from 118 diverse datasets—significantly improving the accuracy–latency trade-off. Furthermore, we introduce the first parameter-free ensemble of GBDT and RealMLP, achieving accuracy competitive with state-of-the-art GBDTs while offering faster inference. On 90 benchmark datasets, it matches top GBDT performance; on the Grinsztajn benchmark, it sets a new SOTA, outperforming advanced methods—including TabR—under default settings. Our core contributions are threefold: (i) a tabular-optimized MLP architecture, (ii) a meta-driven paradigm for robust default hyperparameters, and (iii) a parameter-free, SOTA-level hybrid model.

Technology Category

Application Category

📝 Abstract
For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) strong meta-tuned default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 118 datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 90 datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results on medium-to-large tabular datasets (1K--500K samples) show that RealMLP offers a favorable time-accuracy tradeoff compared to other neural baselines and is competitive with GBDTs in terms of benchmark scores. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results without hyperparameter tuning. Finally, we demonstrate that some of RealMLP's improvements can also considerably improve the performance of TabR with default parameters.
Problem

Research questions and friction points this paper is trying to address.

Large-scale Tabular Data
Deep Learning vs GBDTs
Parameter Tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

RealMLP
Parameter Configuration
Tabular Data Performance
D
David Holzmuller
SIERRA Team, Inria Paris, Ecole Normale Superieure, PSL University
L
L'eo Grinsztajn
SODA Team, Inria Saclay
Ingo Steinwart
Ingo Steinwart
University of Stuttgart
Statistical Learning TheoryKernel MethodsCluster AnalysisSupport Vector MachinesNeural Networks