🤖 AI Summary
Existing vision-language-action models are prone to error accumulation in long-horizon tasks due to the absence of explicit intermediate planning. This work proposes a two-stage generative architecture that natively integrates planning and execution within a discrete action token space: it first predicts a compact, coarse-grained action sequence as a high-level trajectory sketch, then conditions on this sketch to generate fine-grained, executable actions. This approach achieves, for the first time, tight coupling between planning and control under a unified discrete action vocabulary, aligning the plan directly with the control manifold to yield actionable guidance. Experiments demonstrate that the model significantly outperforms direct action generation methods on LIBERO, SimplerEnv-WidowX, and real-robot tasks, with particularly pronounced gains in multi-stage, long-horizon scenarios.
📝 Abstract
Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.