Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) excel at zero-shot task understanding but yield coarse-grained plans, whereas motion primitives (MPs) generate geometrically precise trajectories yet lack semantic grounding—creating a complementary bottleneck in autonomous manipulation. Method: We propose VLM-KMP, a fusion framework that bridges this gap via (1) semantic keypoint representations serving as low-distortion decision-execution interfaces for fine-grained task parameterization under ambiguous scenes, and (2) kernelized motion primitives (KMP) enhanced with local trajectory features to ensure geometric fidelity for complex motions. Contribution/Results: Evaluated in real-world settings, VLM-KMP significantly improves operational adaptability and execution accuracy. It is the first approach to jointly achieve zero-shot task generalization and sub-centimeter trajectory planning precision. The framework demonstrates the feasibility of semantic-driven, fine-grained autonomous manipulation—unifying high-level reasoning with low-level geometric control in a single architecture.

Technology Category

Application Category

📝 Abstract
From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.
Problem

Research questions and friction points this paper is trying to address.

Integrates VLM and KMP for fine-grained robotic manipulation.
Enables precise task parameter generation via semantic keypoints.
Supports complex trajectory shape preservation in ambiguous situations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates VLM with Kernelized Movement Primitives
Uses semantic keypoints for precise task parameter generation
Introduces local trajectory feature-enhanced KMP
🔎 Similar Papers
No similar papers found.
Junjie Zhu
Junjie Zhu
Shanghai Jiao Tong University
Intrinsically Disordered ProteinsGenerative ModelEnhanced Sampling
H
Huayu Liu
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China; Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems, Zhejiang University, Hangzhou 310058, China; Robotics Research Center of Yuyao City, Ningbo 315400, China
J
Jin Wang
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China; Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems, Zhejiang University, Hangzhou 310058, China; Robotics Research Center of Yuyao City, Ningbo 315400, China
B
Bangrong Wen
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China; Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems, Zhejiang University, Hangzhou 310058, China; Robotics Research Center of Yuyao City, Ningbo 315400, China
Kaixiang Huang
Kaixiang Huang
Zhejiang University
Computer visionHuman-Robot interaction
X
Xiao-Fei Li
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China; Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems, Zhejiang University, Hangzhou 310058, China; Robotics Research Center of Yuyao City, Ningbo 315400, China
H
Haiyun Zhan
School of Robotics, Ningbo University of Technology, Ningbo 315211, China
G
Guodong Lu
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China; Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems, Zhejiang University, Hangzhou 310058, China; Robotics Research Center of Yuyao City, Ningbo 315400, China