Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization of synthetic-data-pretrained tabular foundation models (e.g., TabPFN) on real-world small datasets, this paper proposes a continual pretraining paradigm grounded in high-quality real-world tabular data. Unlike prior approaches relying on noisy public data sources, we curate a large-scale, high signal-to-noise-ratio collection of real-world tabular datasets and perform targeted continual pretraining within a probabilistic modeling framework—effectively bridging the domain gap between synthetic and real data. Our method requires no architectural modifications or changes to downstream fine-tuning procedures; only lightweight continual pretraining is needed to substantially improve model adaptability. Evaluated on 29 heterogeneous small datasets from the OpenML AutoML Benchmark, our approach achieves an average accuracy gain of 3.2% over baselines, demonstrating significant improvements in cross-domain generalization and few-shot performance.

Technology Category

Application Category

📝 Abstract
Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
Problem

Research questions and friction points this paper is trying to address.

Improving tabular foundation models with real-world data
Boosting performance via targeted continued pre-training
Enhancing predictive accuracy on diverse OpenML datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued pre-training with real-world data
Leveraging curated large real-world datasets
Superior predictive accuracy on benchmarks
🔎 Similar Papers
No similar papers found.