🤖 AI Summary
Vision-language models (VLMs) excel at zero-shot task understanding but yield coarse-grained plans, whereas motion primitives (MPs) generate geometrically precise trajectories yet lack semantic grounding—creating a complementary bottleneck in autonomous manipulation. Method: We propose VLM-KMP, a fusion framework that bridges this gap via (1) semantic keypoint representations serving as low-distortion decision-execution interfaces for fine-grained task parameterization under ambiguous scenes, and (2) kernelized motion primitives (KMP) enhanced with local trajectory features to ensure geometric fidelity for complex motions. Contribution/Results: Evaluated in real-world settings, VLM-KMP significantly improves operational adaptability and execution accuracy. It is the first approach to jointly achieve zero-shot task generalization and sub-centimeter trajectory planning precision. The framework demonstrates the feasibility of semantic-driven, fine-grained autonomous manipulation—unifying high-level reasoning with low-level geometric control in a single architecture.
📝 Abstract
From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.