🤖 AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models struggle to maintain goal progress in long-horizon mobile manipulation tasks and are prone to error accumulation from early mistakes. The authors propose the MPVI framework, which achieves, for the first time, a training-free interleaved integration of motion planning and VLA. By leveraging open-vocabulary object detection and frontier-based exploration to locate occluded or distant objects, and combining visual-language model–based task-completion assessment with an egocentric perception–triggered switching mechanism, MPVI enables reliable inter-module coordination. Requiring no additional training data, the method improves task progression by 113% over state-of-the-art end-to-end VLA baselines on the BEHAVIOR-1K benchmark, substantially enhancing robustness in long-horizon tasks.
📝 Abstract
Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.