🤖 AI Summary
This work addresses the challenge of enabling robots to efficiently interpret natural language instructions and generate corresponding actions, balancing data efficiency with robust language grounding. The authors propose a novel framework that integrates task-parameterized kernelized movement primitives (TP-KMPs) with a fine-tuning-free pretrained vision-language model (VLM). In this approach, the VLM parses high-level instructions to select and compose skills, while TP-KMPs generate parameterized motor commands. The system further incorporates covariance-weighted fusion and an active learning mechanism to request human demonstrations when uncertain. Evaluated on a 7-degree-of-freedom robotic arm, the framework achieves task success rates ranging from 73.3% to 100%, demonstrating significant improvements in language-driven skill generalization, compositional reasoning, and interactive learning capabilities.
📝 Abstract
Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.