🤖 AI Summary
Current vision-language-action (VLA) models suffer from excessive parameter counts and reliance on large-scale robotic pretraining data, leading to high computational costs, deployment challenges, and degradation of the vision-language backbone’s representational capacity—causing downstream overfitting and poor generalization. To address this, we propose a lightweight VLA framework featuring a two-stage training paradigm and a cross-modulated diffusion Transformer architecture, which preserves the native semantic representation capability of multimodal vision-language models (VLMs) without any robot-data pretraining. An optimized multimodal integration module is further introduced to enhance coordination between perception and action control. With only 770 million parameters, our model achieves state-of-the-art performance—outperforming prior methods by 12.4% on Meta-World and 6.9% on RoboTwin—and attains a 78% task success rate in real-world settings, while enabling efficient inference and low memory footprint.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.