AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

๐Ÿ“… 2025-09-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses natural-language-guided mobile manipulation in unknown indoor environments. Methodologically, it introduces a modular vision-language-action framework that parses instructions into structured task graphs, integrates LiDAR-camera SLAM, metric-semantic mapping, and frontier-based exploration, and innovatively couples classical geometric navigation with a fine-tuned SmolVLA manipulation headโ€”augmented by visibility-and-reachability-aware pre-grasp planning. The core contribution is achieving language-conditioned cross-environment task generalization and robust exploration-manipulation coordination. Evaluated in multi-room laboratory settings, the fully onboard system runs in real time on consumer-grade hardware, attaining a 46% end-to-end task success rate while supporting embedded deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a $46%$ overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.
Problem

Research questions and friction points this paper is trying to address.

Natural language pick-and-place in unseen unpredictable indoor environments
Combining classical SLAM with language-conditioned visual manipulation capabilities
Achieving reliable mobile manipulation on embedded consumer-level hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-parsed task graph conditions SLAM
Visibility-aware planner selects pre-grasp poses
Fine-tuned VLA head grounds visual context
๐Ÿ”Ž Similar Papers
No similar papers found.
Konstantin Gubernatorov
Konstantin Gubernatorov
Skolkovo Institute of Science and Technology
RoboticsVLASLAMWarehouse AutomationMulti-Agent Task Allocation
A
Artem Voronov
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
R
Roman Voronov
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
S
Sergei Pasynkov
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
Stepan Perminov
Stepan Perminov
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
Z
Ziang Guo
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
Dzmitry Tsetserukou
Dzmitry Tsetserukou
Associate Professor, Skolkovo Institute of Science and Technology (Skoltech)
RoboticsHapticsUAV SwarmAIVR