FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Autoregressive vision-language-action (VLA) models face a fundamental trade-off between reconstruction fidelity and inference efficiency in action tokenization. To address this, we propose FASTerβ€”a novel end-to-end action modeling framework that encodes continuous action sequences as single-channel action images. FASTer integrates a learnable tokenizer, vector quantization (VQ), and block-wise autoregressive decoding, augmented by a lightweight action expert network. This design enables high-fidelity action reconstruction at aggressive compression ratios while significantly improving cross-task and cross-morphology generalization. Evaluated on both simulation and real-robot benchmarks, FASTer achieves higher task success rates with substantially faster inference speed, consistently outperforming existing state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
Problem

Research questions and friction points this paper is trying to address.

Improves action tokenization efficiency in VLA models
Enhances robot task performance with faster inference
Achieves superior cross-task generalization in robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encodes action chunks as single-channel images
Uses block-wise autoregressive decoding for efficiency
Integrates lightweight action expert for performance
πŸ”Ž Similar Papers