EdgeVLA: Efficient Vision-Language-Action Models

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the slow inference and deployment challenges of large-scale vision-language-action (VLA) models on resource-constrained mobile devices, this paper proposes an efficient edge-oriented VLA framework. First, it introduces a non-autoregressive end-effector pose prediction module to replace conventional autoregressive decoding, achieving a 7× inference speedup. Second, it substitutes large language models (LLMs) with lightweight small language models (SLMs), significantly reducing computational and memory overhead. Third, it integrates vision-language alignment pretraining with end-to-end inference optimization. Experiments demonstrate that the proposed method maintains downstream task performance comparable to OpenVLA while substantially reducing inference latency and GPU memory consumption—enabling, for the first time, real-time visuomotor control on mobile platforms.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training href{https://github.com/kscalelabs/evla }{codebase} to foster further research.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VLA model speed for edge devices

Reducing computational demands in robotics VLMs

Enabling real-time visuomotor control on mobiles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Eliminates autoregressive position prediction for speed

Uses Small Language Models for efficiency

Enables real-time VLA performance on edge devices

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow