CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling robots to efficiently interpret natural language instructions and generate corresponding actions, balancing data efficiency with robust language grounding. The authors propose a novel framework that integrates task-parameterized kernelized movement primitives (TP-KMPs) with a fine-tuning-free pretrained vision-language model (VLM). In this approach, the VLM parses high-level instructions to select and compose skills, while TP-KMPs generate parameterized motor commands. The system further incorporates covariance-weighted fusion and an active learning mechanism to request human demonstrations when uncertain. Evaluated on a 7-degree-of-freedom robotic arm, the framework achieves task success rates ranging from 73.3% to 100%, demonstrating significant improvements in language-driven skill generalization, compositional reasoning, and interactive learning capabilities.
📝 Abstract
Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.
Problem

Research questions and friction points this paper is trying to address.

natural language grounding
data efficiency
robot skill composition
task-parameterized learning
foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

task-parameterized learning
vision-language models
movement primitives
skill composition
data-efficient imitation learning