DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak zero-shot generalization of language-conditioned multitask imitation learning in novel long-horizon 3D manipulation tasks. We propose a generalization framework based on task decomposition and skill composition. Our key contributions are: (1) a physics-interaction-driven atomic task decomposition mechanism, the first of its kind; (2) a vision-language model (VLM)-guided dynamic skill retrieval and spatially aware skill chaining scheduler, enabling end-to-end mapping from natural language instructions to executable skill sequences; and (3) a reusable atomic skill library, validated on the DeCoBench simulation benchmark. Experiments demonstrate an average success rate improvement of 48.7% across 12 novel long-horizon tasks. On a real robot, training on only six atomic tasks enables zero-shot execution of nine unseen tasks, achieving a 53.33% average success rate gain. The framework bridges compositional reasoning with embodied skill execution, significantly enhancing generalization in complex 3D manipulation settings.

Technology Category

Application Category

📝 Abstract
Generalizing language-conditioned multi-task imitation learning (IL) models to novel long-horizon 3D manipulation tasks remains a significant challenge. To address this, we propose DeCo (Task Decomposition and Skill Composition), a model-agnostic framework compatible with various multi-task IL models, designed to enhance their zero-shot generalization to novel, compositional, long-horizon 3D manipulation tasks. DeCo first decomposes IL demonstrations into a set of modular atomic tasks based on the physical interaction between the gripper and objects, and constructs an atomic training dataset that enables models to learn a diverse set of reusable atomic skills during imitation learning. At inference time, DeCo leverages a vision-language model (VLM) to parse high-level instructions for novel long-horizon tasks, retrieve the relevant atomic skills, and dynamically schedule their execution; a spatially-aware skill-chaining module then ensures smooth, collision-free transitions between sequential skills. We evaluate DeCo in simulation using DeCoBench, a benchmark specifically designed to assess zero-shot generalization of multi-task IL models in compositional long-horizon 3D manipulation. Across three representative multi-task IL models (RVT-2, 3DDA, and ARP), DeCo achieves success rate improvements of 66.67%, 21.53%, and 57.92%, respectively, on 12 novel compositional tasks. Moreover, in real-world experiments, a DeCo-enhanced model trained on only 6 atomic tasks successfully completes 9 novel long-horizon tasks, yielding an average success rate improvement of 53.33% over the base multi-task IL model. Video demonstrations are available at: https://deco226.github.io.
Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot generalization for long-horizon 3D manipulation tasks
Decomposing tasks into modular atomic skills for imitation learning
Dynamically scheduling skills for novel compositional tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeCo decomposes tasks into modular atomic skills
Uses VLM to parse and retrieve relevant skills
Spatially-aware skill-chaining ensures smooth transitions
🔎 Similar Papers
No similar papers found.