InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models suffer from spurious correlations between task-irrelevant visual features and actions, undermining cross-scenario generalization. To address this, we propose Intrinsic Spatial Reasoning (InSpire), a lightweight mechanism that enhances spatial awareness and causal robustness in VLAs—without additional data or model parameters. InSpire leverages directional spatial questioning and alignment, built upon a pre-trained vision-language foundation model, and integrates instruction prefix augmentation, joint spatial-answer–action alignment, and autoregressive action decoding. Evaluated on both simulation and real-world robotic platforms, InSpire achieves significant improvements in cross-task and cross-environment generalization, while enabling plug-and-play deployment. The code, models, and demonstration videos are publicly released.

Technology Category

Application Category

📝 Abstract
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question"In which direction is the [object] relative to the robot?"to the language instruction and aligning the answer"right/left/up/down/front/back/grasped"and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.
Problem

Research questions and friction points this paper is trying to address.

VLAs spuriously correlate irrelevant visual features with actions
Existing VLAs lack spatial reasoning for task-relevant factors
InSpire enhances VLAs' spatial reasoning without extra data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances VLAs with intrinsic spatial reasoning
Uses directional questions to focus attention
No extra data or large models needed
🔎 Similar Papers
No similar papers found.
J
Ji Zhang
Southwest Jiaotong University
Shihan Wu
Shihan Wu
MS Student, University of Electronic Science and Technology of China
Computer VisionVision-Language ModelsRobotics
Xu Luo
Xu Luo
UESTC
Machine LearningRobotics
H
Hao Wu
University of Electronic Science and Technology of China
Lianli Gao
Lianli Gao
UESTC
Vision and Language
H
Heng Tao Shen
Tongji University
J
Jingkuan Song
Tongji University