Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Existing document parsing and OCR benchmarks struggle to evaluate models’ true capabilities on expert-level complex documents—such as chemical formulas, musical scores, and cross-page tables. To address this gap, this work introduces Dr. DocBench, the first domain-expert-oriented, difficulty-aware document parsing benchmark. Constructed from multilingual book corpora, it employs a parser-failure-driven sampling strategy to curate 4,514 challenging pages, annotated with 65k fine-grained labels covering layout, reading order, hierarchical structure, and domain-specific content across 52 disciplines. Experiments reveal substantial performance degradation among state-of-the-art document parsing systems and general-purpose vision-language models on this benchmark, highlighting their limitations in professional content understanding, modeling of intricate structures, and cross-page contextual reasoning, thereby validating Dr. DocBench’s challenge and efficacy.
📝 Abstract
Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.
Problem

Research questions and friction points this paper is trying to address.

document parsing
expert-level documents
benchmark
complex layouts
domain-specific content
Innovation

Methods, ideas, or system contributions that make the work stand out.

document parsing
expert-level benchmark
parser-failure-based sampling
domain-specific visual content
vision-language models
🔎 Similar Papers
Minglai Yang
Minglai Yang
CS Undergraduate student, University of Arizona
Natural Language ProcessingLarge Language ModelsMachine Learning
X
Xinyan Velocity Yu
University of Southern California
P
Pengyuan Li
IBM Research
Xinyu Guo
Xinyu Guo
Samsung Research America
AIcomputer visionmachine learningmedical image analysis
Zhenting Qi
Zhenting Qi
Harvard University
Natural Language ProcessingDeep LearningMachine Learning
K
Konwoo Kim
Stanford University
L
Longtian Ye
2077AI
X
Xiaolong Luo
Harvard University
Jinhe Bi
Jinhe Bi
LMU Munich
Efficient AIM/LLM
H
Henry Zhang
UC Berkeley
Haris Riaz
Haris Riaz
PhD Student, University of Arizona
large language modelsapplied machine learningartificial intelligence
X
Xuan Zhang
2077AI
Yunze Xiao
Yunze Xiao
Language Technology Institute, Carnegie Mellon University
Natural Language ProcessingComputational Social ScienceAnthropomorphism
B
Bangya Liu
2077AI
T
Tom Tang
2077AI
Yunfei Zhao
Yunfei Zhao
Peking University
intelligent programcode generationcode representation
Qunshu Lin
Qunshu Lin
Co-Founder of Abaka.AI
Data-Centric AI
Z
Zihan Wang
2077AI
M
Minghao Liu
2077AI
Michael Lingzhi Li
Michael Lingzhi Li
Assistant Professor, Harvard Business School
Integer OptimizationCausal InferencePrecision MedicineMachine LearningAI for Healthcare
Yilun Du
Yilun Du
Harvard University
Artificial IntelligenceMachine LearningRoboticsComputer Vision
Jesse Thomason
Jesse Thomason
Assistant Professor, University of Southern California
Natural Language ProcessingArtificial IntelligenceRobotics
Rogerio Feris
Rogerio Feris
Research Manager, MIT-IBM Watson AI Lab
Computer VisionMachine LearningArtificial Intelligence
A
Alex Pentland
MIT
Zexue He
Zexue He
University of California, San Diego
Trustworthy NLPLLM