🤖 AI Summary
Addressing the challenges of multi-stage skill coordination and poor generalization of action sequences in long-horizon dexterous robotic manipulation (e.g., pick-and-place and tight packing), this paper proposes a Multi-Head Skill Transformer architecture. Methodologically, it introduces three key innovations: (1) a novel skill-level “progress value” mechanism enabling interpretable skill selection and smooth transitions; (2) dynamic skill expansion capability and adaptive subtask sequencing; and (3) integration of motion primitive learning with progress-guided skill execution, coupled with a simulation-to-reality transfer training strategy. Evaluated on both simulated and real robotic platforms, the approach achieves significant improvements in task success rate, supports longer skill chains, and demonstrates superior cross-task generalization—outperforming state-of-the-art methods across all metrics.
📝 Abstract
Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a"progress value"for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.