Rethinking Intermediate Representation for VLM-based Robot Manipulation

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Vision-language models (VLMs) struggle to simultaneously achieve interpretability and generalization when mapping natural-language instructions to parseable intermediate representations for robotic manipulation. Method: We propose Semantic Assembly Representation (SEAM), the first approach to model manipulation semantics using context-free grammars—decoupling open-vocabulary, semantically rich lexicons from VLM-friendly, structurally concise syntax. SEAM introduces an open-vocabulary segmentation paradigm for fine-grained part localization and defines novel, quantifiable metrics for VLM interpretability and action generalization. It integrates retrieval-augmented few-shot learning, semantic parsing, and open-vocabulary segmentation. Results: Evaluated across multiple real-world manipulation tasks, SEAM achieves state-of-the-art performance with the shortest inference latency, significantly improving interpretability, cross-task generalization, and deployment practicality.

Technology Category

Application Category

📝 Abstract

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

Problem

Research questions and friction points this paper is trying to address.

Designs semantic representation balancing VLM understandability and task generalizability

Creates open-vocabulary segmentation for precise object part localization

Establishes new metrics evaluating action adaptability and model comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposing representation into vocabulary and grammar

Open-vocabulary segmentation with retrieval-augmented learning

New metrics for action-generalizability and VLM-comprehensibility

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance