BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the limitations in information retrieval and spatial localization within document intelligence, this paper introduces the first unified Visual Document Question Answering (VDQA) dataset. The dataset integrates multi-source, real-world document images and systematically reformulates traditional information extraction tasks into a question-answering format, with each answer annotated by its precise bounding box coordinates in the original image. Methodologically, it combines OCR-based text extraction, spatial coordinate annotation, bounding-box-guided prompt engineering, and multimodal fine-tuning and inference using open-source large language models (e.g., Llama, Qwen). Key contributions include: (1) the first unified modeling framework for multi-source document AI data with spatially aware annotations; (2) establishing the transferability of the QA paradigm to spatial localization tasks; and (3) empirical validation that bounding-box-guided prompting significantly improves localization accuracy and generalization across multiple document QA benchmarks.

Technology Category

Application Category

📝 Abstract

We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

Problem

Research questions and friction points this paper is trying to address.

Document Question Answering

Visual Document Understanding

Information Retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

BoundingDocs

Document Intelligence

Visual Document Understanding

🔎 Similar Papers

Question Answering Over Spatio-Temporal Knowledge Graph