🤖 AI Summary
Current vision-language models (VLMs) suffer from fundamental limitations in embodied spatial understanding and action generation due to modality fragmentation, pretraining distribution shift, and objective misalignment—hindering their evolution toward general embodied intelligence. To address these challenges, we propose WALL-OSS: the first end-to-end differentiable embodied foundation model supporting instruction reasoning, subgoal decomposition, and fine-grained action generation. WALL-OSS employs a tightly coupled multimodal architecture jointly optimized with a Unified Cross-Level Chain-of-Thought (CoT) training paradigm, enabling deep integration of language, spatial reasoning, and motor control. Evaluated on complex, long-horizon manipulation tasks, WALL-OSS significantly outperforms strong baselines, demonstrating superior instruction following, spatial reasoning, and embodied execution capabilities. This work establishes a novel paradigm for advancing VLMs toward fully capable embodied agents.
📝 Abstract
While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI.
We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability.
Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.