T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing research on generating article-level reports from industrial tabular data faces two critical bottlenecks: (1) heterogeneous table structures severely degrade generation quality, and (2) a lack of realistic, scenario-oriented evaluation benchmarks. This paper formally defines the “Table-to-Report Generation” task and introduces T2R-bench—the first bilingual, industry-scale benchmark—covering 19 domains and four table types, constructed from real-world data with rigorous human annotation. We propose a multidimensional evaluation framework and systematically assess 25 state-of-the-art large language models. Experimental results reveal that even the top-performing model (DeepSeek-R1) achieves only a composite score of 62.71, highlighting fundamental limitations in current approaches. Our work establishes an authoritative benchmark, a standardized evaluation paradigm, and a diagnostic toolkit for capability analysis, thereby advancing practical research in tabular understanding and report generation.

Technology Category

Application Category

📝 Abstract

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.

Problem

Research questions and friction points this paper is trying to address.

Transforming industrial tables into article-level reports

Addressing complexity and diversity in table reasoning

Assessing practical application of table-to-report generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark T2R-bench for table-to-report

Real-world industrial tables across 19 domains

Proposed evaluation criteria for report generation quality

🔎 Similar Papers

No similar papers found.

Authors to Follow