NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies

πŸ“… 2025-10-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Deploying Vision-Language-Action (VLA) models on resource-constrained edge devices (e.g., Jetson Orin Nano) faces challenges of high inference latency, poor energy efficiency, and deployment complexity. To address these, this paper proposes NanoVLAβ€”a lightweight VLA model tailored for edge robotics. Its core innovations include: (1) a decoupled vision-language understanding architecture enabling feature caching and computation latency optimization; (2) a long-short action chunking mechanism preserving semantic coherence in multi-step planning; and (3) task-complexity-aware dynamic routing that adaptively selects model backbones to maximize energy efficiency. Leveraging a lightweight visual encoder, late fusion, and collaborative long/short-term action sequence decoding, NanoVLA matches or exceeds state-of-the-art models in task accuracy and generalization while achieving up to 52Γ— inference speedup and 98% parameter reduction. Extensive evaluation across standard benchmarks and real-world robotic tasks validates its efficiency and practicality.

Technology Category

Application Category

πŸ“ Abstract
Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.
Problem

Research questions and friction points this paper is trying to address.

Enabling efficient vision-language-action models on resource-constrained edge devices
Reducing computational demands and latency for real-time robotic deployment
Maintaining high task accuracy with minimal parameters and faster inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language decoupling reduces latency and overhead
Long-short action chunking ensures smooth multi-step planning
Dynamic routing optimizes efficiency based on task complexity
πŸ”Ž Similar Papers
No similar papers found.