π€ AI Summary
To address the limitations in information retrieval and spatial localization within document intelligence, this paper introduces the first unified Visual Document Question Answering (VDQA) dataset. The dataset integrates multi-source, real-world document images and systematically reformulates traditional information extraction tasks into a question-answering format, with each answer annotated by its precise bounding box coordinates in the original image. Methodologically, it combines OCR-based text extraction, spatial coordinate annotation, bounding-box-guided prompt engineering, and multimodal fine-tuning and inference using open-source large language models (e.g., Llama, Qwen). Key contributions include: (1) the first unified modeling framework for multi-source document AI data with spatially aware annotations; (2) establishing the transferability of the QA paradigm to spatial localization tasks; and (3) empirical validation that bounding-box-guided prompting significantly improves localization accuracy and generalization across multiple document QA benchmarks.
π Abstract
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.