Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling

πŸ“… 2025-10-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing task and motion planning (TAMP) approaches suffer from prohibitive computational overhead in long-horizon tasks due to excessive motion sampling; while large language models (LLMs) encode commonsense priors, they lack 3D geometric and dynamical reasoning capabilities. This paper proposes a Vision-Language Model (VLM)-driven TAMP framework that unifies symbolic task states and continuous motion states within a hybrid state tree. Crucially, it tightly couples VLM-based visual reasoning with dynamical validation during searchβ€”via VLM-guided sampling, interleaved search strategies, joint verification using off-the-shelf motion planners and physics simulators, and visual rendering of intermediate states to refine search direction. Evaluated in simulation and real-world settings, our method improves task success rates by 32.14%–1166.67% over traditional and LLM-based baselines, while substantially reducing planning time. Ablation studies confirm the critical role of VLM guidance in enhancing both efficiency and solution feasibility.

Technology Category

Application Category

πŸ“ Abstract
Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM guidance.
Problem

Research questions and friction points this paper is trying to address.

Reduces excessive motion sampling in long-horizon TAMP problems
Integrates symbolic and numeric states for joint task-motion decisions
Ensures kinodynamic feasibility using motion planner and physics simulator
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid state tree integrates symbolic and numeric states
VLM guides task planning with visual state rendering
Kinodynamic constraints verified by motion planner and simulator
πŸ”Ž Similar Papers
No similar papers found.
M
Minseo Kwon
Department of Computer Science and Engineering at Ewha Womans University in Korea
Young J. Kim
Young J. Kim
Ewha Womans University
computer graphicsroboticshapticsgames