🤖 AI Summary
This work addresses the challenge in robotic manipulation where coarse positioning and fine contact interactions are tightly coupled, hindering effective policy learning. To overcome this, the authors propose explicitly decoupling manipulation tasks into two distinct phases: “movement” and “manipulation.” By incorporating structured inductive biases, they design a dual-expert policy architecture paired with a learnable phase selector. Multimodal large language models (MLLMs) are leveraged to automatically generate human-aligned phase labels during training. Evaluated on the RoboTwin2 benchmark, the method achieves a 68.9% success rate—outperforming monolithic policy baselines by 24%—while reducing training steps by 40%. Remarkably, it matches the performance of models trained with ten times more data, demonstrating substantially improved generalization and sample efficiency.
📝 Abstract
We present Move-Then-Operate, a Vision language action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of $68.9\%$, outperforming the monolithic $π_0$ baseline by $24\%$. It matches or exceeds models trained on $10\times$ more data and reaches peak performance in $40\%$ fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.