Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing “Thinking with Images” approaches rely on frequent calls to external vision tools to gather local evidence, resulting in verbose reasoning processes that are susceptible to noise. This work proposes an “imaginative reasoning” mechanism that internalizes visual reasoning as a tool-free simulation process: the model autonomously selects regions of interest and predicts fine-grained visual cues without external intervention. To realize this, we introduce the Imagine-OPD framework, which employs on-policy self-distillation to generate privileged teacher signals from annotated regions, guiding the student model to learn high-quality imagination trajectories without requiring external teachers or demonstration data. Experiments demonstrate that our method achieves state-of-the-art average performance across multiple vision-centric benchmarks while significantly reducing inference overhead.

📝 Abstract

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

tool invocation

imagination

self-distillation

fine-grained perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinking with Imagination

On-Policy Self-Distillation

Visual Reasoning