TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

110K/year

🤖 AI Summary

Transformer models, while effective for few-shot tabular classification, suffer from excessive parameter counts and high computational overhead; lightweight alternatives, in contrast, lack sufficient representational capacity. Method: We propose a knowledge distillation framework that transfers semantic representations and domain adaptation capabilities—learned by a pretrained Transformer under few-shot conditions—into a significantly smaller neural network. Distillation is guided solely by soft labels, requiring no additional labeled data. Contribution/Results: To our knowledge, this is the first work to successfully distill Transformer knowledge into simple architectures for few-shot tabular tasks. Experiments across multiple benchmarks demonstrate that the distilled models consistently outperform XGBoost, logistic regression, and baseline MLPs—and in several cases even surpass the original Transformer—achieving superior accuracy with drastically reduced parameter counts. This approach effectively resolves the long-standing trade-off between model performance and parameter efficiency in few-shot tabular learning.

Technology Category

Application Category

📝 Abstract

Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.

Problem

Research questions and friction points this paper is trying to address.

Distilling complex transformers into simpler neural networks for tabular data

Achieving parameter efficiency while maintaining few-shot classification performance

Enabling distilled models to surpass classical baselines with limited training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling transformers into neural networks

Parameter-efficient few-shot tabular classification

Surpassing original transformers and classical baselines

🔎 Similar Papers

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science