Towards Reliable and Interpretable Document Question Answering via VLMs

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Current vision-language models (VLMs) frequently exhibit the “correct answer, incorrect localization” phenomenon in document question answering, undermining spatial interpretability and hindering real-world deployment. To address this, we propose DocExplainerV0—a plug-and-play, decoupled bounding-box prediction module that enables precise answer localization without fine-tuning proprietary VLMs. Our method explicitly separates text generation from spatial grounding, enabling independent optimization of linguistic accuracy and geometric precision. We further introduce the first standardized evaluation framework jointly assessing both textual correctness and localization fidelity. Experiments demonstrate that DocExplainerV0 substantially improves localization consistency (+32.7% mAP) and exposes systematic spatial reasoning deficiencies across mainstream VLMs. The benchmark dataset and evaluation protocol are publicly released, establishing a new paradigm and reliable metric for interpretability research in document-level VLMs.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce extit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable answer localization in document question answering

Bridges the gap between textual accuracy and spatial grounding

Enhances interpretability and robustness of vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play bounding-box prediction module

Decouples answer generation from localization

Applicable to existing VLMs without fine-tuning

🔎 Similar Papers

No similar papers found.