T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on generating article-level reports from industrial tabular data faces two critical bottlenecks: (1) heterogeneous table structures severely degrade generation quality, and (2) a lack of realistic, scenario-oriented evaluation benchmarks. This paper formally defines the “Table-to-Report Generation” task and introduces T2R-bench—the first bilingual, industry-scale benchmark—covering 19 domains and four table types, constructed from real-world data with rigorous human annotation. We propose a multidimensional evaluation framework and systematically assess 25 state-of-the-art large language models. Experimental results reveal that even the top-performing model (DeepSeek-R1) achieves only a composite score of 62.71, highlighting fundamental limitations in current approaches. Our work establishes an authoritative benchmark, a standardized evaluation paradigm, and a diagnostic toolkit for capability analysis, thereby advancing practical research in tabular understanding and report generation.

Technology Category

Application Category

📝 Abstract
Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.
Problem

Research questions and friction points this paper is trying to address.

Transforming industrial tables into article-level reports
Addressing complexity and diversity in table reasoning
Assessing practical application of table-to-report generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark T2R-bench for table-to-report
Real-world industrial tables across 19 domains
Proposed evaluation criteria for report generation quality
🔎 Similar Papers
No similar papers found.
J
Jie Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
C
Changzai Pan
Institute of Artificial Intelligence (TeleAI), China Telecom
K
Kaiwen Wei
Chongqing University
S
Sishi Xiong
Institute of Artificial Intelligence (TeleAI), China Telecom
Y
Yu Zhao
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xiangyu Li
Institute of Artificial Intelligence (TeleAI), China Telecom
J
Jiaxin Peng
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xiaoyan Gu
Institute of Artificial Intelligence (TeleAI), China Telecom
J
Jian Yang
Beihang University
W
Wenhan Chang
Institute of Artificial Intelligence (TeleAI), China Telecom
Z
Zhenhe Wu
Beihang University
J
Jiang Zhong
Chongqing University
S
Shuangyong Song
Institute of Artificial Intelligence (TeleAI), China Telecom
Yongxiang Li
Yongxiang Li
Professor, RMIT University
Electronic Materials and Devices
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom