Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in Vision-Language-Action (VLA) models—including insufficient modeling of scene details in complex spatial environments, a significant modality gap between visual perception and low-level actions, and misalignment between visual prediction and action generation objectives leading to training instability—this paper proposes a hybrid vision-action modality framework. Our method introduces: (1) a shared discrete latent space unifying visual observations and primitive actions; (2) an Implicit Visual Chain-of-Thought mechanism that internalizes visual dynamics as an inductive bias for motion planning; and (3) the Vision-Integrated Trajectory Alignment (VITA) architecture, jointly optimizing autoregressive action generation, future-frame prediction, and action decoding. Evaluated on CALVIN, LIBERO, and SimplerEnv benchmarks, our approach achieves absolute improvements of 14.5%, 9.6%, and 12.1%, respectively, with an average success rate of 80.5% across six tasks—substantially outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5%, 9.6% and 12.1% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.
Problem

Research questions and friction points this paper is trying to address.

Bridging modality gap between visual observations and robot actions
Resolving training instability from competing visual-action objectives
Integrating visual reasoning with robotic motion planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-modality pipeline with implicit visual chain-of-thought
Shared discrete latent space for vision and action
Autoregressive tokens decoded into frames and actions
🔎 Similar Papers
No similar papers found.
Xiangkai Ma
Xiangkai Ma
Nanjing University
Time seriesMulti modalityVision language action
L
Lekai Xing
State Key Laboratory for Novel Software Technology, Nanjing University
H
Han Zhang
State Key Laboratory for Novel Software Technology, Nanjing University
W
Wenzhong Li
State Key Laboratory for Novel Software Technology, Nanjing University
S
Sanglu Lu
State Key Laboratory for Novel Software Technology, Nanjing University