Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

๐Ÿ“… 2025-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision-language model (VLM)-driven robotic planners lack formal safety guarantees and interpretability, whereas traditional symbolic planners require extensive domain-specific expert knowledge. Method: We propose ViLaIn-TAMP, a novel hybrid task-and-motion planning (TAMP) framework that integrates a vision-language interpreter with symbolic logic reasoning and geometric constraint solving in a closed-loop correction architecture. It enables safety-verifiable, interpretable, and end-to-end task-and-motion co-planning directly from natural language and visual inputsโ€”without domain-specific fine-tuning. Contribution/Results: By unifying off-the-shelf VLMs, symbolic inference, geometric solvers, and learned manipulation skills, ViLaIn-TAMP supports fully traceable and explainable behavior generation. Evaluated on complex cooking tasks, its closed-loop feedback improves average success rate by over 30%, significantly enhancing robustness and reliability in dynamic, real-world environments.

Technology Category

Application Category

๐Ÿ“ Abstract
While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) ViLaIn (Vision-Language Interpreter) - A prior framework that converts multimodal inputs into structured problem specifications using off-the-shelf VLMs without additional domain-specific training, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning and can utilize learning-based skills for key manipulation phases, and (3) a corrective planning module which receives concrete feedback on failed solution attempts from the motion and task planning components and can feed adapted logic and geometric feasibility constraints back to ViLaIn to improve and further refine the specification. We evaluate our framework on several challenging manipulation tasks in a cooking domain. We demonstrate that the proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between vision-language models and symbolic planners
Ensuring verifiable and interpretable autonomous robot behaviors
Improving success rates in manipulation tasks via corrective planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid planning framework for verifiable robot behaviors
Modular TAMP system with symbolic-geometric reasoning
Closed-loop corrective planning for improved success
๐Ÿ”Ž Similar Papers
No similar papers found.
Jeremy Siburian
Jeremy Siburian
Graduate Student, The University of Tokyo
Robotics
Keisuke Shirai
Keisuke Shirai
AIST
Natural Language ProcessingRobotics
C
C. C. Beltran-Hernandez
OMRON SINIC X Corporation
Masashi Hamaya
Masashi Hamaya
OMRON SINIC X Corp.
Robot LearningSoft RoboticsRobotics
M
Michael Gorner
University of Hamburg
A
Atsushi Hashimoto
OMRON SINIC X Corporation