Make Your VLA More Robust Without More Data By Interleaving Motion Planning

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language-action (VLA) models struggle to maintain goal progress in long-horizon mobile manipulation tasks and are prone to error accumulation from early mistakes. The authors propose the MPVI framework, which achieves, for the first time, a training-free interleaved integration of motion planning and VLA. By leveraging open-vocabulary object detection and frontier-based exploration to locate occluded or distant objects, and combining visual-language model–based task-completion assessment with an egocentric perception–triggered switching mechanism, MPVI enables reliable inter-module coordination. Requiring no additional training data, the method improves task progression by 113% over state-of-the-art end-to-end VLA baselines on the BEHAVIOR-1K benchmark, substantially enhancing robustness in long-horizon tasks.

📝 Abstract

Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

long-horizon tasks

execution errors

mobile manipulation

task robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Motion Planning

Long-horizon Tasks

Open-vocabulary Object Detection

Robustness

🔎 Similar Papers

Real-time Motion Planning for autonomous vehicles in dynamic environments

2024-06-05arXiv.orgCitations: 3

RMP-YOLO: A Robust Motion Predictor for Partially Observable Scenarios even if You Only Look Once

2024-09-18arXiv.orgCitations: 2