🤖 AI Summary
Vision-language models (VLMs) struggle to simultaneously achieve interpretability and generalization when mapping natural-language instructions to parseable intermediate representations for robotic manipulation.
Method: We propose Semantic Assembly Representation (SEAM), the first approach to model manipulation semantics using context-free grammars—decoupling open-vocabulary, semantically rich lexicons from VLM-friendly, structurally concise syntax. SEAM introduces an open-vocabulary segmentation paradigm for fine-grained part localization and defines novel, quantifiable metrics for VLM interpretability and action generalization. It integrates retrieval-augmented few-shot learning, semantic parsing, and open-vocabulary segmentation.
Results: Evaluated across multiple real-world manipulation tasks, SEAM achieves state-of-the-art performance with the shortest inference latency, significantly improving interpretability, cross-task generalization, and deployment practicality.
📝 Abstract
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.