Igniting VLMs toward the Embodied Space

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) suffer from fundamental limitations in embodied spatial understanding and action generation due to modality fragmentation, pretraining distribution shift, and objective misalignment—hindering their evolution toward general embodied intelligence. To address these challenges, we propose WALL-OSS: the first end-to-end differentiable embodied foundation model supporting instruction reasoning, subgoal decomposition, and fine-grained action generation. WALL-OSS employs a tightly coupled multimodal architecture jointly optimized with a Unified Cross-Level Chain-of-Thought (CoT) training paradigm, enabling deep integration of language, spatial reasoning, and motor control. Evaluated on complex, long-horizon manipulation tasks, WALL-OSS significantly outperforms strong baselines, demonstrating superior instruction following, spatial reasoning, and embodied execution capabilities. This work establishes a novel paradigm for advancing VLMs toward fully capable embodied agents.

Technology Category

Application Category

📝 Abstract
While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.
Problem

Research questions and friction points this paper is trying to address.

Addressing VLMs' limited spatial and embodiment understanding
Overcoming modality mismatches in embodied transfer for action comprehension
Bridging vision-language models to embodied action generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end embodied foundation model architecture
Unified cross-level reasoning and action synthesis
Multimodal pretraining with multi-strategies curriculum
🔎 Similar Papers
No similar papers found.
A
Andy Zhai
X SQUARE ROBOT
B
Brae Liu
X SQUARE ROBOT
B
Bruno Fang
X SQUARE ROBOT
C
Chalse Cai
X SQUARE ROBOT
E
Ellie Ma
X SQUARE ROBOT
E
Ethan Yin
X SQUARE ROBOT
H
Hao Wang
X SQUARE ROBOT
H
Hugo Zhou
X SQUARE ROBOT
James Wang
James Wang
Columbia University
Columbia University
L
Lights Shi
X SQUARE ROBOT
L
Lucy Liang
X SQUARE ROBOT
M
Make Wang
X SQUARE ROBOT
Q
Qian Wang
X SQUARE ROBOT
R
Roy Gan
X SQUARE ROBOT
R
Ryan Yu
X SQUARE ROBOT
S
Shalfun Li
X SQUARE ROBOT
S
Starrick Liu
X SQUARE ROBOT
S
Sylas Chen
X SQUARE ROBOT
V
Vincent Chen
X SQUARE ROBOT
Z
Zach Xu
X SQUARE ROBOT