🤖 AI Summary
This work addresses the challenge that existing large vision-language models (LVLMs) struggle to accurately localize critical visual evidence and perform fine-grained reasoning under noisy conditions. The authors propose a training-free, bottom-up framework that hierarchically scans multi-scale visual cues and integrates a refocusing strategy—coordinating LVLMs with vision experts—together with a hybrid evidence memory mechanism. This enables hierarchical exploration and multi-granularity fusion of visual evidence. Notably, the method achieves significant improvements in robustness and interpretability on complex scenes without any model fine-tuning. It attains a 90.6% accuracy on the V* benchmark using Qwen2.5-VL-7B and demonstrates consistent performance gains across diverse LVLM architectures and scales.
📝 Abstract
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.