🤖 AI Summary
Existing diffusion-based policies often generate actions misaligned with high-level task intentions. To address this, this work proposes Skill-conditioned Diffusion Policy (SDP), which decomposes complex manipulation tasks into reusable, fine-grained primitive skills—such as “move up” or “open gripper”—and leverages a vision-language model to extract discrete representations of both environmental states and language instructions. A lightweight routing network dynamically selects the most relevant skill at each step, guiding a single-skill diffusion policy to produce aligned actions. This approach establishes a novel, skill-consistent action generation paradigm by integrating interpretable primitives with diffusion models for the first time, significantly enhancing cross-task generalization. Experiments demonstrate that SDP outperforms state-of-the-art methods across two simulation benchmarks and real-world robotic platforms, confirming its effectiveness and robustness.
📝 Abstract
Diffusion policies (DP) have recently shown great promise for generating actions in robotic manipulation. However, existing approaches often rely on global instructions to produce short-term control signals, which can result in misalignment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipulations, such as ``move up''and ``open the gripper'', provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned DP that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language instructions. Based on them, a lightweight router network is designed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, SDP ensures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consistently outperforms SOTA methods, providing a new paradigm for skill-based robot learning with diffusion policies.