EdgeVLA: Efficient Vision-Language-Action Models

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference and deployment challenges of large-scale vision-language-action (VLA) models on resource-constrained mobile devices, this paper proposes an efficient edge-oriented VLA framework. First, it introduces a non-autoregressive end-effector pose prediction module to replace conventional autoregressive decoding, achieving a 7× inference speedup. Second, it substitutes large language models (LLMs) with lightweight small language models (SLMs), significantly reducing computational and memory overhead. Third, it integrates vision-language alignment pretraining with end-to-end inference optimization. Experiments demonstrate that the proposed method maintains downstream task performance comparable to OpenVLA while substantially reducing inference latency and GPU memory consumption—enabling, for the first time, real-time visuomotor control on mobile platforms.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training href{https://github.com/kscalelabs/evla }{codebase} to foster further research.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLA model speed for edge devices
Reducing computational demands in robotics VLMs
Enabling real-time visuomotor control on mobiles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Eliminates autoregressive position prediction for speed
Uses Small Language Models for efficiency
Enables real-time VLA performance on edge devices
🔎 Similar Papers
Paweł Budzianowski
Paweł Budzianowski
Unknown affiliation
artificial intelligence
Wesley Maa
Wesley Maa
Unknown affiliation
M
Matthew Freed
K-Scale Labs
J
Jingxiang Mo
McGill University
W
Winston Hsiao
K-Scale Labs
A
Aaron Xie
K-Scale Labs
T
Tomasz Młoduchowski
K-Scale Labs
V
Viraj Tipnis
K-Scale Labs
B
Benjamin Bolte
K-Scale Labs