π€ AI Summary
Deploying Vision-Language-Action (VLA) models on resource-constrained edge devices (e.g., Jetson Orin Nano) faces challenges of high inference latency, poor energy efficiency, and deployment complexity. To address these, this paper proposes NanoVLAβa lightweight VLA model tailored for edge robotics. Its core innovations include: (1) a decoupled vision-language understanding architecture enabling feature caching and computation latency optimization; (2) a long-short action chunking mechanism preserving semantic coherence in multi-step planning; and (3) task-complexity-aware dynamic routing that adaptively selects model backbones to maximize energy efficiency. Leveraging a lightweight visual encoder, late fusion, and collaborative long/short-term action sequence decoding, NanoVLA matches or exceeds state-of-the-art models in task accuracy and generalization while achieving up to 52Γ inference speedup and 98% parameter reduction. Extensive evaluation across standard benchmarks and real-world robotic tasks validates its efficiency and practicality.
π Abstract
Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.