DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing large vision-language models (LVLMs) struggle to accurately localize critical visual evidence and perform fine-grained reasoning under noisy conditions. The authors propose a training-free, bottom-up framework that hierarchically scans multi-scale visual cues and integrates a refocusing strategy—coordinating LVLMs with vision experts—together with a hybrid evidence memory mechanism. This enables hierarchical exploration and multi-granularity fusion of visual evidence. Notably, the method achieves significant improvements in robustness and interpretability on complex scenes without any model fine-tuning. It attains a 90.6% accuracy on the V* benchmark using Qwen2.5-VL-7B and demonstrates consistent performance gains across diverse LVLM architectures and scales.

Technology Category

Application Category

📝 Abstract

Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.

Problem

Research questions and friction points this paper is trying to address.

visually grounded reasoning

large vision-language models

evidence localization

distractive context

fine-grained visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

hierarchical scanning

visually grounded reasoning

evidence-enhanced reasoning

large vision-language models

🔎 Similar Papers

No similar papers found.

Authors to Follow