🤖 AI Summary
This study critically examines the generalization capabilities of current table language models (TLMs), suggesting that their reported high performance may stem from data contamination and familiarity with task formats rather than genuine reasoning ability. Through a systematic reproduction of Tabula-8B’s results across 165 datasets in the UniPredict benchmark, complemented by large-scale evaluation, data contamination detection, ablation studies, and instruction tuning without tabular inputs, the work reveals that TLMs exhibit minimal substantive generalization on most classification tasks. Specifically, median performance gains are near zero for both binary and multiclass classification; 92.2% of performance can be recovered through non-tabular fine-tuning; and 71.3% of the interquartile performance gap is attributable to format familiarity. These findings call for a fundamental re-evaluation of prevailing TLM evaluation paradigms.
📝 Abstract
Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.