🤖 AI Summary
Existing document parsing and OCR benchmarks struggle to evaluate models’ true capabilities on expert-level complex documents—such as chemical formulas, musical scores, and cross-page tables. To address this gap, this work introduces Dr. DocBench, the first domain-expert-oriented, difficulty-aware document parsing benchmark. Constructed from multilingual book corpora, it employs a parser-failure-driven sampling strategy to curate 4,514 challenging pages, annotated with 65k fine-grained labels covering layout, reading order, hierarchical structure, and domain-specific content across 52 disciplines. Experiments reveal substantial performance degradation among state-of-the-art document parsing systems and general-purpose vision-language models on this benchmark, highlighting their limitations in professional content understanding, modeling of intricate structures, and cross-page contextual reasoning, thereby validating Dr. DocBench’s challenge and efficacy.
📝 Abstract
Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.