RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing document parsing benchmarks predominantly rely on synthetic or clean documents and evaluate performance using a single OCR or Markdown similarity metric, which fails to capture the precise extraction of critical fields required in real-world applications. This work proposes the first dual-track evaluation benchmark tailored for authentic regulatory documents: a QA track assessing content accuracy through field-level question answering, and a layout track measuring structural understanding via COCO-style bounding box annotations. The benchmark introduces a typed answer dictionary and an adjacency-aware Hungarian matching scoring mechanism, while also incorporating cost and latency metrics into a unified framework. Evaluation of 18 state-of-the-art systems reveals substantial disparities in field-level accuracy, efficiency, and domain adaptability—particularly in healthcare. The project releases an open-source dataset, adapters, and a complete evaluation toolkit to enable fine-grained, reproducible comparisons of parsing systems.

📝 Abstract

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

Problem

Research questions and friction points this paper is trying to address.

document parsing

field-level QA

layout understanding

regulated documents

real-world documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

field-level QA

layout understanding

document parsing benchmark