AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses natural-language-guided mobile manipulation in unknown indoor environments. Methodologically, it introduces a modular vision-language-action framework that parses instructions into structured task graphs, integrates LiDAR-camera SLAM, metric-semantic mapping, and frontier-based exploration, and innovatively couples classical geometric navigation with a fine-tuned SmolVLA manipulation head—augmented by visibility-and-reachability-aware pre-grasp planning. The core contribution is achieving language-conditioned cross-environment task generalization and robust exploration-manipulation coordination. Evaluated in multi-room laboratory settings, the fully onboard system runs in real time on consumer-grade hardware, attaining a 46% end-to-end task success rate while supporting embedded deployment.

Technology Category

Application Category

📝 Abstract

We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a $46%$ overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.

Problem

Research questions and friction points this paper is trying to address.

Natural language pick-and-place in unseen unpredictable indoor environments

Combining classical SLAM with language-conditioned visual manipulation capabilities

Achieving reliable mobile manipulation on embedded consumer-level hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-parsed task graph conditions SLAM

Visibility-aware planner selects pre-grasp poses

Fine-tuned VLA head grounds visual context

🔎 Similar Papers

No similar papers found.

Authors to Follow