VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models exhibit limited generalization to novel objects and unseen environments, while auxiliary modules—such as depth estimation, segmentation, or diffusion models—introduce substantial computational overhead and impair inference efficiency. This work proposes an efficient generalization framework that eliminates tokenizers, employs parallel fine-tuning, and integrates trajectory-level voting for rapid, robust action prediction—without relying on external visual modules. Key innovations include a tokenizer-free fine-tuning paradigm, parallel action decoding, and a lightweight ensemble strategy. Evaluated across multiple robotic manipulation benchmarks, the method achieves state-of-the-art (SOTA) performance, accelerates inference by 35×, and attains a throughput of 145 Hz—demonstrating unprecedented balance between generalization capability and real-time execution.

Technology Category

Application Category

📝 Abstract
Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$ imes$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization of Vision-Language-Action models for novel objects and environments
Reducing computational overhead without additional visual representations
Accelerating inference speed while maintaining high performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-free fine-tuning for parallel action prediction
Ensemble voting strategy enhances generalization
Achieves 35x faster inference with 145 Hz
🔎 Similar Papers
No similar papers found.
J
Juyi Lin
Northeastern University
A
Amir Taherin
Northeastern University
A
Arash Akbari
Northeastern University
A
Arman Akbari
Northeastern University
L
Lei Lu
Northeastern University
G
Guangyu Chen
Northeastern University
Taskin Padir
Taskin Padir
Professor, Northeastern University; Scholar, Amazon
Robotics
X
Xiaomeng Yang
Northeastern University
W
Weiwei Chen
EmbodyX Inc
Y
Yiqian Li
Northeastern University
Xue Lin
Xue Lin
Northeastern University
electrical and computer engineering
D
David Kaeli
Northeastern University
P
Pu Zhao
Northeastern University
Y
Yanzhi Wang
Northeastern University