Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the instability and generalization bottlenecks in existing embodied chain-of-thought (CoT) methods arising from the tight coupling of reasoning and action. The authors propose ERVLA, a novel framework that constructs the largest-scale embodied CoT corpus to date and innovatively employs CoT solely as a representation supervision signal during training—rather than as a fixed prefix for forced decoding at test time. Additionally, they introduce a reasoning dropout mechanism to mitigate autoregressive error accumulation, enabling efficient and stable end-to-end action prediction. Evaluated on LIBERO-Plus and VLABench, ERVLA achieves success rates of 86.9% and 53.2%, respectively, substantially outperforming prior approaches and demonstrating exceptional out-of-distribution generalization, particularly in semantic disambiguation and long-horizon tasks.
📝 Abstract
Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution. Code, data, and model checkpoints will be released.
Problem

Research questions and friction points this paper is trying to address.

embodied chain-of-thought
robot manipulation
vision-language-action models
generalization
reasoning-action coupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Chain-of-Thought
Vision-Language-Action Models
Reasoning-Dropout
Representation-Shaping Supervision
Out-of-Distribution Generalization