DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) show promise for end-to-end document parsing and translation, yet mainstream benchmarks (e.g., OmniDocBench, DITrans) focus solely on scanned or born-digital documents, neglecting critical real-world challenges—such as geometric distortion and illumination variation—in camera-captured documents. Method: We introduce the first comprehensive benchmark for real-world captured documents, comprising 1,300+ high-resolution, cross-domain, manually captured samples, supporting eight translation tasks and providing human-verified ground-truth annotations. Contribution/Results: Our systematic evaluation reveals substantial performance degradation under imaging impairments: mainstream MLLMs suffer an average 18% drop in parsing accuracy and a 12% decline in BLEU score; domain-specific document models degrade by up to 25%. This work fills a critical gap in robustness evaluation for captured documents and establishes a new standard and diagnostic toolkit for multimodal document understanding.

Technology Category

Application Category

📝 Abstract
The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking photographed document parsing under real-world capture conditions
Evaluating translation performance degradation in photographed documents
Addressing geometric and photometric distortions in document analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces photographed document benchmark with real-world challenges
Includes 1300 high-resolution photos from multiple domains
Provides human-verified parsing and translation annotations
🔎 Similar Papers
No similar papers found.
Yongkun Du
Yongkun Du
复旦大学
Computer VisionOCR
P
Pinxuan Chen
College of Computer Science and Artificial Intelligence, Fudan University, China
X
Xuye Ying
College of Computer Science and Artificial Intelligence, Fudan University, China
Zhineng Chen
Zhineng Chen
Institute of Trustworthy Embodied AI, Fudan University
Computer VisionOCRMultimedia Analysis