Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

πŸ“… 2025-09-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Visual reasoning models (VRMs) often exhibit diminished visual attention during long-answer generation, leading to inadequate visual reflection capability. To address this, we propose Reflection-Vβ€”the first framework systematically enhancing VRMs’ visual reflection ability. Our approach comprises three core components: (i) an agent-collaborative mechanism integrating vision-language models and reasoning-oriented large language models to construct high-quality, vision-centric reasoning data; (ii) a vision-attention-map-based reward model that explicitly enforces sustained reliance on visual inputs during reinforcement learning; and (iii) a cold-start data construction strategy to alleviate annotation bottlenecks. Evaluated across multiple visual reasoning benchmarks, Reflection-V achieves significant improvements over state-of-the-art methods, boosting visual attention stability by 23.6% while concurrently enhancing reasoning consistency and generalization.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models ( extbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires extbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM extbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, extbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, extbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual reflection in vision-language models
Addressing diminished visual attention in longer responses
Improving visual reasoning through reinforcement learning rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-centered reasoning data construction
Visual attention reward model design
Cold-start learning for visual reflection
πŸ”Ž Similar Papers
No similar papers found.
Pu Jian
Pu Jian
China Academy of Sciences Institute of Automation
MultimodalMaching LearningNLP
Junhong Wu
Junhong Wu
PhD student, Institute of Automation, Chinese Academy of Sciences
Natural language processinglifelong learning
W
Wei Sun
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
C
Chen Wang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
S
Shuo Ren
Institute of Automation, Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing