Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-language models (VLMs) excel at zero-shot task understanding but yield coarse-grained plans, whereas motion primitives (MPs) generate geometrically precise trajectories yet lack semantic grounding—creating a complementary bottleneck in autonomous manipulation. Method: We propose VLM-KMP, a fusion framework that bridges this gap via (1) semantic keypoint representations serving as low-distortion decision-execution interfaces for fine-grained task parameterization under ambiguous scenes, and (2) kernelized motion primitives (KMP) enhanced with local trajectory features to ensure geometric fidelity for complex motions. Contribution/Results: Evaluated in real-world settings, VLM-KMP significantly improves operational adaptability and execution accuracy. It is the first approach to jointly achieve zero-shot task generalization and sub-centimeter trajectory planning precision. The framework demonstrates the feasibility of semantic-driven, fine-grained autonomous manipulation—unifying high-level reasoning with low-level geometric control in a single architecture.

Technology Category

Application Category

📝 Abstract

From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.

Problem

Research questions and friction points this paper is trying to address.

Integrates VLM and KMP for fine-grained robotic manipulation.

Enables precise task parameter generation via semantic keypoints.

Supports complex trajectory shape preservation in ambiguous situations.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates VLM with Kernelized Movement Primitives

Uses semantic keypoints for precise task parameter generation

Introduces local trajectory feature-enhanced KMP

🔎 Similar Papers

No similar papers found.

Authors to Follow