🤖 AI Summary
Existing research on generating article-level reports from industrial tabular data faces two critical bottlenecks: (1) heterogeneous table structures severely degrade generation quality, and (2) a lack of realistic, scenario-oriented evaluation benchmarks. This paper formally defines the “Table-to-Report Generation” task and introduces T2R-bench—the first bilingual, industry-scale benchmark—covering 19 domains and four table types, constructed from real-world data with rigorous human annotation. We propose a multidimensional evaluation framework and systematically assess 25 state-of-the-art large language models. Experimental results reveal that even the top-performing model (DeepSeek-R1) achieves only a composite score of 62.71, highlighting fundamental limitations in current approaches. Our work establishes an authoritative benchmark, a standardized evaluation paradigm, and a diagnostic toolkit for capability analysis, thereby advancing practical research in tabular understanding and report generation.
📝 Abstract
Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.