KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In scenarios with frequent occlusions (e.g., tool occlusion during cake decoration, hand occlusion during dough kneading), robots struggle to generate complex, temporally coherent, and intention-consistent motion sequences from single vision-language instructions. Method: This paper proposes a VLM-DMP fusion framework for sequential motion generation. It introduces two novel mechanisms: (1) a keyword-driven primitive retrieval module that maps high-level semantic instructions to Dynamic Movement Primitive (DMP) templates, and (2) a keypoint-pair-guided spatial parameter generalization module that enables robust adaptation of DMP parameters to varying object geometries and poses. The framework further supports dynamic composition of DMP sequences to represent multi-step actions. Results: Evaluated on simulated and real-world object cutting tasks, the method significantly outperforms existing VLM-augmented DMP approaches, demonstrating superior robustness under multiple occlusions and validating its capability to maintain intention consistency and motion generalization under high perceptual uncertainty.

Technology Category

Application Category

📝 Abstract
Dynamic Movement Primitives (DMPs) provide a flexible framework wherein smooth robotic motions are encoded into modular parameters. However, they face challenges in integrating multimodal inputs commonly used in robotics like vision and language into their framework. To fully maximize DMPs' potential, enabling them to handle multimodal inputs is essential. In addition, we also aim to extend DMPs' capability to handle object-focused tasks requiring one-shot complex motion generation, as observation occlusion could easily happen mid-execution in such tasks (e.g., knife occlusion in cake icing, hand occlusion in dough kneading, etc.). A promising approach is to leverage Vision-Language Models (VLMs), which process multimodal data and can grasp high-level concepts. However, they typically lack enough knowledge and capabilities to directly infer low-level motion details and instead only serve as a bridge between high-level instructions and low-level control. To address this limitation, we propose Keyword Labeled Primitive Selection and Keypoint Pairs Generation Guided Movement Primitives (KeyMPs), a framework that combines VLMs with sequencing of DMPs. KeyMPs use VLMs' high-level reasoning capability to select a reference primitive through keyword labeled primitive selection and VLMs' spatial awareness to generate spatial scaling parameters used for sequencing DMPs by generalizing the overall motion through keypoint pairs generation, which together enable one-shot vision-language guided motion generation that aligns with the intent expressed in the multimodal input. We validate our approach through an occlusion-rich manipulation task, specifically object cutting experiments in both simulated and real-world environments, demonstrating superior performance over other DMP-based methods that integrate VLMs support.
Problem

Research questions and friction points this paper is trying to address.

Integrate vision and language inputs into DMPs for robotics
Enable one-shot complex motion generation for occlusion-rich tasks
Combine VLMs with DMP sequencing for intent-aligned motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combine VLMs with DMPs sequencing
Keyword labeled primitive selection
Keypoint pairs guide motion generation
E
Edgar Anarossi
Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
Y
Yuhwan Kwon
Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
H
Hirotaka Tahara
Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
Shohei Tanaka
Shohei Tanaka
OMRON SINIC X
natural language processingdialogue systemmultimodal interactionrobotics
Keisuke Shirai
Keisuke Shirai
AIST
Natural Language ProcessingRobotics
Masashi Hamaya
Masashi Hamaya
OMRON SINIC X Corp.
Robot LearningSoft RoboticsRobotics
C
C. C. Beltran-Hernandez
OMRON SINIC X Corporation, Tokyo, Japan
A
Atsushi Hashimoto
OMRON SINIC X Corporation, Tokyo, Japan
Takamitsu Matsubara
Takamitsu Matsubara
Nara Institute of Science and Technology
Robot LearningMachine LearningReinforcement LearningRobotics