The Illusion of Generalization: Re-examining Tabular Language Model Evaluation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study critically examines the generalization capabilities of current table language models (TLMs), suggesting that their reported high performance may stem from data contamination and familiarity with task formats rather than genuine reasoning ability. Through a systematic reproduction of Tabula-8B’s results across 165 datasets in the UniPredict benchmark, complemented by large-scale evaluation, data contamination detection, ablation studies, and instruction tuning without tabular inputs, the work reveals that TLMs exhibit minimal substantive generalization on most classification tasks. Specifically, median performance gains are near zero for both binary and multiclass classification; 92.2% of performance can be recovered through non-tabular fine-tuning; and 71.3% of the interquartile performance gap is attributable to format familiarity. These findings call for a fundamental re-evaluation of prevailing TLM evaluation paradigms.

Technology Category

Application Category

📝 Abstract

Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

tabular language models

generalization

evaluation artifacts

data contamination

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tabular Language Models

evaluation artifacts

data contamination