🤖 AI Summary
Current vision-language models (VLMs) frequently exhibit the “correct answer, incorrect localization” phenomenon in document question answering, undermining spatial interpretability and hindering real-world deployment. To address this, we propose DocExplainerV0—a plug-and-play, decoupled bounding-box prediction module that enables precise answer localization without fine-tuning proprietary VLMs. Our method explicitly separates text generation from spatial grounding, enabling independent optimization of linguistic accuracy and geometric precision. We further introduce the first standardized evaluation framework jointly assessing both textual correctness and localization fidelity. Experiments demonstrate that DocExplainerV0 substantially improves localization consistency (+32.7% mAP) and exposes systematic spatial reasoning deficiencies across mainstream VLMs. The benchmark dataset and evaluation protocol are publicly released, establishing a new paradigm and reliable metric for interpretability research in document-level VLMs.
📝 Abstract
Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce extit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.