LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

📅 2024-12-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing long-document evaluation benchmarks suffer from limited page-scale coverage, inadequate assessment of layout-aware element localization, and insufficient support for complex numerical reasoning. To address these gaps, we introduce LongDocBench—the first multimodal, comprehensive benchmark for long-document understanding—featuring a systematic “comprehension–reasoning–localization” three-dimensional evaluation framework. It comprises 20 diverse subtasks, 2,325 high-quality QA pairs, and is constructed from over 33,000 pages of real-world documents. We propose a semi-automated data curation pipeline integrating structured document parsing, vision-language alignment verification, and a cross-model unified evaluation protocol. Extensive evaluation across 26 open- and closed-source large vision-language models reveals critical bottlenecks in cross-page localization and multi-step numerical reasoning. LongDocBench establishes the first authoritative, multidimensional diagnostic benchmark for long-document multimodal understanding, enabling rigorous model assessment and targeted advancement.

Technology Category

Application Category

📝 Abstract

Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

Problem

Research questions and friction points this paper is trying to address.

Long Document Understanding

Visual Language Model

Document Element Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

LongDocURL

Visual Language Model Evaluation

Long Document Understanding

🔎 Similar Papers

No similar papers found.

Authors to Follow