Describe Anything Model for Visual Question Answering on Text-rich Images

πŸ“… 2025-07-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing visual question answering (VQA) methods struggle with fine-grained extraction and reasoning over localized text in text-dense images. Method: We propose DAM-QA, the first framework to integrate Region-Aware Descriptive Arbitrary Modeling (DAM) into text-rich VQAβ€”generating multi-region image descriptions without additional localization supervision and aggregating answers via a learnable region fusion mechanism. Contributions/Results: (1) A lightweight region-aware vision-language modeling architecture; (2) A dedicated evaluation protocol and evidence-driven multi-region reasoning paradigm. DAM-QA outperforms baseline DAM across six VQA benchmarks, achieving over a 7-point improvement on DocVQA. With significantly fewer parameters, it attains performance comparable to mainstream large models, establishing a new state-of-the-art in parameter-efficient, region-aware VQA.

Technology Category

Application Category

πŸ“ Abstract
Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Visual Question Answering on text-rich images
Leveraging region-aware descriptions for fine-grained text extraction
Improving answer accuracy in dense text image scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-aware Describe Anything Model (DAM)
Multi-region answer aggregation mechanism
Tailored evaluation protocol for text-rich VQA
πŸ”Ž Similar Papers
No similar papers found.