LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-document evaluation benchmarks suffer from limited page-scale coverage, inadequate assessment of layout-aware element localization, and insufficient support for complex numerical reasoning. To address these gaps, we introduce LongDocBench—the first multimodal, comprehensive benchmark for long-document understanding—featuring a systematic “comprehension–reasoning–localization” three-dimensional evaluation framework. It comprises 20 diverse subtasks, 2,325 high-quality QA pairs, and is constructed from over 33,000 pages of real-world documents. We propose a semi-automated data curation pipeline integrating structured document parsing, vision-language alignment verification, and a cross-model unified evaluation protocol. Extensive evaluation across 26 open- and closed-source large vision-language models reveals critical bottlenecks in cross-page localization and multi-step numerical reasoning. LongDocBench establishes the first authoritative, multidimensional diagnostic benchmark for long-document multimodal understanding, enabling rigorous model assessment and targeted advancement.

Technology Category

Application Category

📝 Abstract
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.
Problem

Research questions and friction points this paper is trying to address.

Long Document Understanding
Visual Language Model
Document Element Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

LongDocURL
Visual Language Model Evaluation
Long Document Understanding
🔎 Similar Papers
No similar papers found.
C
Chao Deng
MAIS, Institute of Automation of Chinese Academy of Sciences; Alibaba Group
J
Jiale Yuan
School of Artificial Intelligence, University of Chinese Academy of Sciences; Alibaba Group
P
Pi Bu
School of Artificial Intelligence, University of Chinese Academy of Sciences; Alibaba Group
Peijie Wang
Peijie Wang
Institute of Automation Chinese Academy of Sciences
Multimodal LLMsmath reasoning
Z
Zhong-Zhi Li
MAIS, Institute of Automation of Chinese Academy of Sciences
J
Jian Xu
MAIS, Institute of Automation of Chinese Academy of Sciences
Xiao-Hui Li
Xiao-Hui Li
Huawei;The Hong Kong University of Science and Technology
Multimodal Large Language Modelsexplainable artificial intelligencePhysics
Y
Yuan Gao
School of Artificial Intelligence, University of Chinese Academy of Sciences; Alibaba Group
Jun Song
Jun Song
Shenzhen University
nanophotonics
B
Bo Zheng
School of Artificial Intelligence, University of Chinese Academy of Sciences; Alibaba Group
C
Cheng-Lin Liu
MAIS, Institute of Automation of Chinese Academy of Sciences; Alibaba Group