RhinoVLA Technical Report

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deploying Vision-Language-Action (VLA) models at the edge is hindered by high computational latency arising from visual and contextual tokens. This work proposes RhinoVLA, co-designed with the Huixi R1 edge SoC, leveraging a token-efficient Qwen3-VL backbone and continuous-action experts, along with a unified multi-robot interface that incorporates view registration, a 72-dimensional state-action slot, and instantiated LoRA modules to align heterogeneous robot policies. Through hardware-aware compilation, mixed-precision execution, and parallelized visual encoding, the system achieves an end-to-end inference rate of 11.69 Hz on the Huixi R1, satisfying the 10 Hz real-time control requirement while matching the downstream task performance of π0.5.

📝 Abstract

Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

real-time deployment

edge hardware

token efficiency

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

token efficiency

edge deployment