NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Deploying Vision-Language-Action (VLA) models on resource-constrained edge devices (e.g., Jetson Orin Nano) faces challenges of high inference latency, poor energy efficiency, and deployment complexity. To address these, this paper proposes NanoVLA—a lightweight VLA model tailored for edge robotics. Its core innovations include: (1) a decoupled vision-language understanding architecture enabling feature caching and computation latency optimization; (2) a long-short action chunking mechanism preserving semantic coherence in multi-step planning; and (3) task-complexity-aware dynamic routing that adaptively selects model backbones to maximize energy efficiency. Leveraging a lightweight visual encoder, late fusion, and collaborative long/short-term action sequence decoding, NanoVLA matches or exceeds state-of-the-art models in task accuracy and generalization while achieving up to 52× inference speedup and 98% parameter reduction. Extensive evaluation across standard benchmarks and real-world robotic tasks validates its efficiency and practicality.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient vision-language-action models on resource-constrained edge devices

Reducing computational demands and latency for real-time robotic deployment

Maintaining high task accuracy with minimal parameters and faster inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language decoupling reduces latency and overhead

Long-short action chunking ensures smooth multi-step planning

Dynamic routing optimizes efficiency based on task complexity

🔎 Similar Papers

TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation