Rethinking Intermediate Representation for VLM-based Robot Manipulation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) struggle to simultaneously achieve interpretability and generalization when mapping natural-language instructions to parseable intermediate representations for robotic manipulation. Method: We propose Semantic Assembly Representation (SEAM), the first approach to model manipulation semantics using context-free grammars—decoupling open-vocabulary, semantically rich lexicons from VLM-friendly, structurally concise syntax. SEAM introduces an open-vocabulary segmentation paradigm for fine-grained part localization and defines novel, quantifiable metrics for VLM interpretability and action generalization. It integrates retrieval-augmented few-shot learning, semantic parsing, and open-vocabulary segmentation. Results: Evaluated across multiple real-world manipulation tasks, SEAM achieves state-of-the-art performance with the shortest inference latency, significantly improving interpretability, cross-task generalization, and deployment practicality.

Technology Category

Application Category

📝 Abstract
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.
Problem

Research questions and friction points this paper is trying to address.

Designs semantic representation balancing VLM understandability and task generalizability
Creates open-vocabulary segmentation for precise object part localization
Establishes new metrics evaluating action adaptability and model comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposing representation into vocabulary and grammar
Open-vocabulary segmentation with retrieval-augmented learning
New metrics for action-generalizability and VLM-comprehensibility
🔎 Similar Papers
No similar papers found.