🤖 AI Summary
To address the misalignment between local actions and global task objectives in robotic dexterous manipulation, this paper proposes a goal-oriented backward trajectory autoregressive modeling paradigm: generating full action sequences stepwise in reverse, starting from task-critical frames. Our method tokenizes continuous actions and models trajectories via autoregressive generation, enabling dynamic termination and temporal consistency. Key contributions include: (i) the first integration of action-level chain-of-thought (CoT) reasoning with backward autoregressive generation; and (ii) four synergistic designs—critical-frame anchoring, dynamic truncation, inverse-time integration, and multi-token joint prediction. Evaluated on 60 RLBench simulation tasks and 8 real-world robotic tasks, our approach achieves state-of-the-art performance, significantly improving spatial generalization and goal consistency.
📝 Abstract
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.