TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TableQA benchmarks are largely confined to monolingual, flat-table settings, suffering from data leakage and poor real-world applicability. To address these limitations, we propose TableEval—the first realistic, multilingual (Simplified Chinese, Traditional Chinese, English), and multi-structural (compact, hierarchical, nested) TableQA benchmark. It spans four high-impact domains—government, finance, academia, and industry—and draws exclusively on recent authentic documents, all manually verified. We introduce SEAT, a novel fine-grained evaluation framework that enables automated scoring via sub-question-level semantic alignment, achieving 0.92 Pearson correlation with human annotations. Extensive experiments expose critical bottlenecks in state-of-the-art LLMs across cross-lingual transfer, complex structural parsing, and multi-hop reasoning. The TableEval dataset is publicly released and has been widely adopted by the research community.

Technology Category

Application Category

📝 Abstract
LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on complex, multilingual TableQA tasks
Addressing data leakage in existing TableQA benchmarks
Proposing semantic accuracy metrics for TableQA evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse table structures from multiple domains
Cross-lingual tables in Chinese and English
SEAT framework for semantic accuracy evaluation
🔎 Similar Papers
No similar papers found.
Junnan Zhu
Junnan Zhu
Institute of Automation Chinese Academy of Sciences
Natural Language Processing
J
Jingyi Wang
Beijing Wenge Technology Co., Ltd., State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China
B
Bohan Yu
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, China
Xiaoyu Wu
Xiaoyu Wu
Central University of Finance and Economics
development economicslabor economicshealth economics
Junbo Li
Junbo Li
University of Texas at Austin
agentic reasoning LLMreinforcement learning
L
Lei Wang
Beijing Wenge Technology Co., Ltd., State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China
N
Nan Xu
Beijing Wenge Technology Co., Ltd., State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China