DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing large vision-language models (LVLMs) struggle to accurately localize critical visual evidence and perform fine-grained reasoning under noisy conditions. The authors propose a training-free, bottom-up framework that hierarchically scans multi-scale visual cues and integrates a refocusing strategy—coordinating LVLMs with vision experts—together with a hybrid evidence memory mechanism. This enables hierarchical exploration and multi-granularity fusion of visual evidence. Notably, the method achieves significant improvements in robustness and interpretability on complex scenes without any model fine-tuning. It attains a 90.6% accuracy on the V* benchmark using Qwen2.5-VL-7B and demonstrates consistent performance gains across diverse LVLM architectures and scales.

Technology Category

Application Category

📝 Abstract
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.
Problem

Research questions and friction points this paper is trying to address.

visually grounded reasoning
large vision-language models
evidence localization
distractive context
fine-grained visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
hierarchical scanning
visually grounded reasoning
evidence-enhanced reasoning
large vision-language models
🔎 Similar Papers
No similar papers found.
Yangfu Li
Yangfu Li
East China Normal University
Deep learning
H
Hongjian Zhan
East China Normal University
J
Jiawei Chen
East China Normal University
Y
Yuning Gong
Sichuan University
Q
Qi Liu
East China Normal University
Yue Lu
Yue Lu
East China Normal University
Pattern RecognitionDocument Image Analysis and Recognition