๐ค AI Summary
This work addresses the scarcity of high-quality human-annotated data for Table Question Answering (TQA). We propose the first LLM self-improvement framework for TQA based on synthetically generated data. Methodologically, we model chain-of-thought reasoning as a discrete state sequence, introduce a state-level scoring mechanism and process-aware contrastive sampling, and apply lightweight preference learning via a PPO variant for reinforcement fine-tuning. Using only 8,000 self-generated preference pairs, our approach achieves up to +5.0% accuracy gain on in-domain test sets and +2.4% improvement in out-of-domain generalization. It attains 5ร faster inference than current SOTA models while matching the performance of significantly larger systems. Our core contribution is the first efficient, low-overhead, process-aware self-improvement paradigm for TQAโuniquely balancing generalizability, inference efficiency, and scalability.
๐ Abstract
Improving large language models (LLMs) with self-generated data has demonstrated success in tasks such as mathematical reasoning and code generation. Yet, no exploration has been made on table question answering (TQA), where a system answers questions based on tabular data. Addressing this gap is crucial for TQA, as effective self-improvement can boost performance without requiring costly or manually annotated data. In this work, we propose PPT, a Process-based Preference learning framework for TQA. It decomposes reasoning chains into discrete states, assigns scores to each state, and samples contrastive steps for preference learning. Experimental results show that PPT effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets, with only 8,000 preference pairs. Furthermore, the resulting models achieve competitive results compared to more complex and larger state-of-the-art TQA systems, while being five times more efficient during inference.