Visual Planning: Let's Think Only with Images

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) and multimodal large models (MLLMs) rely on textual representations for spatial and geometric reasoning, suffering from unnatural expressivity and limited structural abstraction. To address this, we propose Visual-Only Planning—a paradigm that eliminates linguistic mediation entirely, performing end-to-end visual-spatial reasoning and decision-making directly over image sequences as both input and output. To realize this, we introduce VPRL, the first reinforcement learning framework tailored for visual reasoning. VPRL integrates GRPO-driven post-training of vision-language foundation models, visual state modeling, autoregressive image-sequence generation policies, and a PPO variant for policy optimization. Evaluated on visual navigation benchmarks—including FrozenLake, Maze, and MiniBehavior—our approach consistently outperforms all text-based planning baselines, demonstrating both the feasibility and superiority of purely visual pathways for spatial reasoning tasks.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
Problem

Research questions and friction points this paper is trying to address.

Enables planning through purely visual representations without text
Improves reasoning in tasks involving spatial and geometrical information
Outperforms text-only reasoning in visual navigation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Planning via purely visual representations
Reinforcement learning framework VPRL
GRPO for post-training vision models
🔎 Similar Papers
No similar papers found.