BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

πŸ“… 2025-01-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limitations in information retrieval and spatial localization within document intelligence, this paper introduces the first unified Visual Document Question Answering (VDQA) dataset. The dataset integrates multi-source, real-world document images and systematically reformulates traditional information extraction tasks into a question-answering format, with each answer annotated by its precise bounding box coordinates in the original image. Methodologically, it combines OCR-based text extraction, spatial coordinate annotation, bounding-box-guided prompt engineering, and multimodal fine-tuning and inference using open-source large language models (e.g., Llama, Qwen). Key contributions include: (1) the first unified modeling framework for multi-source document AI data with spatially aware annotations; (2) establishing the transferability of the QA paradigm to spatial localization tasks; and (3) empirical validation that bounding-box-guided prompting significantly improves localization accuracy and generalization across multiple document QA benchmarks.

Technology Category

Application Category

πŸ“ Abstract
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
Problem

Research questions and friction points this paper is trying to address.

Document Question Answering
Visual Document Understanding
Information Retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

BoundingDocs
Document Intelligence
Visual Document Understanding
πŸ”Ž Similar Papers