🤖 AI Summary
Existing long-document evaluation benchmarks suffer from limited page-scale coverage, inadequate assessment of layout-aware element localization, and insufficient support for complex numerical reasoning. To address these gaps, we introduce LongDocBench—the first multimodal, comprehensive benchmark for long-document understanding—featuring a systematic “comprehension–reasoning–localization” three-dimensional evaluation framework. It comprises 20 diverse subtasks, 2,325 high-quality QA pairs, and is constructed from over 33,000 pages of real-world documents. We propose a semi-automated data curation pipeline integrating structured document parsing, vision-language alignment verification, and a cross-model unified evaluation protocol. Extensive evaluation across 26 open- and closed-source large vision-language models reveals critical bottlenecks in cross-page localization and multi-step numerical reasoning. LongDocBench establishes the first authoritative, multidimensional diagnostic benchmark for long-document multimodal understanding, enabling rigorous model assessment and targeted advancement.
📝 Abstract
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.