Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the misalignment between local actions and global task objectives in robotic dexterous manipulation, this paper proposes a goal-oriented backward trajectory autoregressive modeling paradigm: generating full action sequences stepwise in reverse, starting from task-critical frames. Our method tokenizes continuous actions and models trajectories via autoregressive generation, enabling dynamic termination and temporal consistency. Key contributions include: (i) the first integration of action-level chain-of-thought (CoT) reasoning with backward autoregressive generation; and (ii) four synergistic designs—critical-frame anchoring, dynamic truncation, inverse-time integration, and multi-token joint prediction. Evaluated on 60 RLBench simulation tasks and 8 real-world robotic tasks, our approach achieves state-of-the-art performance, significantly improving spatial generalization and goal consistency.

Technology Category

Application Category

📝 Abstract
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Develops backward-reasoning policy for robotic trajectory generation
Enhances spatial generalization in visuo-motor manipulation tasks
Unifies goal-constrained action prediction via autoregressive modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Backward reasoning for trajectory generation
Unified autoregressive action token structure
Dynamic stopping and multi-token prediction
🔎 Similar Papers
No similar papers found.