Learning Diffusion Policy from Primitive Skills for Robot Manipulation

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based policies often generate actions misaligned with high-level task intentions. To address this, this work proposes Skill-conditioned Diffusion Policy (SDP), which decomposes complex manipulation tasks into reusable, fine-grained primitive skills—such as “move up” or “open gripper”—and leverages a vision-language model to extract discrete representations of both environmental states and language instructions. A lightweight routing network dynamically selects the most relevant skill at each step, guiding a single-skill diffusion policy to produce aligned actions. This approach establishes a novel, skill-consistent action generation paradigm by integrating interpretable primitives with diffusion models for the first time, significantly enhancing cross-task generalization. Experiments demonstrate that SDP outperforms state-of-the-art methods across two simulation benchmarks and real-world robotic platforms, confirming its effectiveness and robustness.

Technology Category

Application Category

📝 Abstract
Diffusion policies (DP) have recently shown great promise for generating actions in robotic manipulation. However, existing approaches often rely on global instructions to produce short-term control signals, which can result in misalignment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipulations, such as ``move up''and ``open the gripper'', provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned DP that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language instructions. Based on them, a lightweight router network is designed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, SDP ensures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consistently outperforms SOTA methods, providing a new paradigm for skill-based robot learning with diffusion policies.
Problem

Research questions and friction points this paper is trying to address.

diffusion policy
robot manipulation
primitive skills
action generation
skill alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy
primitive skills
skill-conditioned control
vision-language model
robot manipulation
🔎 Similar Papers
No similar papers found.
Z
Zhihao Gu
Department of Computer Science, School of Computing and Data Science, The University of Hong Kong
M
Ming Yang
School of Software, Beihang University
Difan Zou
Difan Zou
The University of Hong Kong
Machine LearningDeep LearningOptimizationStochastic AlgorithmsSignal Processing
Dong Xu
Dong Xu
Master of Computer Science, Fudan University
Long Context ModelRAGHallucination