๐ค AI Summary
This work addresses natural-language-guided mobile manipulation in unknown indoor environments. Methodologically, it introduces a modular vision-language-action framework that parses instructions into structured task graphs, integrates LiDAR-camera SLAM, metric-semantic mapping, and frontier-based exploration, and innovatively couples classical geometric navigation with a fine-tuned SmolVLA manipulation headโaugmented by visibility-and-reachability-aware pre-grasp planning. The core contribution is achieving language-conditioned cross-environment task generalization and robust exploration-manipulation coordination. Evaluated in multi-room laboratory settings, the fully onboard system runs in real time on consumer-grade hardware, attaining a 46% end-to-end task success rate while supporting embedded deployment.
๐ Abstract
We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a $46%$ overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.