Igniting VLMs toward the Embodied Space

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models (VLMs) suffer from fundamental limitations in embodied spatial understanding and action generation due to modality fragmentation, pretraining distribution shift, and objective misalignment—hindering their evolution toward general embodied intelligence. To address these challenges, we propose WALL-OSS: the first end-to-end differentiable embodied foundation model supporting instruction reasoning, subgoal decomposition, and fine-grained action generation. WALL-OSS employs a tightly coupled multimodal architecture jointly optimized with a Unified Cross-Level Chain-of-Thought (CoT) training paradigm, enabling deep integration of language, spatial reasoning, and motor control. Evaluated on complex, long-horizon manipulation tasks, WALL-OSS significantly outperforms strong baselines, demonstrating superior instruction following, spatial reasoning, and embodied execution capabilities. This work establishes a novel paradigm for advancing VLMs toward fully capable embodied agents.

Technology Category

Application Category

📝 Abstract

While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

Problem

Research questions and friction points this paper is trying to address.

Addressing VLMs' limited spatial and embodiment understanding

Overcoming modality mismatches in embodied transfer for action comprehension

Bridging vision-language models to embodied action generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end embodied foundation model architecture

Unified cross-level reasoning and action synthesis

Multimodal pretraining with multi-strategies curriculum

🔎 Similar Papers

No similar papers found.

Authors to Follow