π€ AI Summary
Autoregressive vision-language-action (VLA) models face a fundamental trade-off between reconstruction fidelity and inference efficiency in action tokenization. To address this, we propose FASTerβa novel end-to-end action modeling framework that encodes continuous action sequences as single-channel action images. FASTer integrates a learnable tokenizer, vector quantization (VQ), and block-wise autoregressive decoding, augmented by a lightweight action expert network. This design enables high-fidelity action reconstruction at aggressive compression ratios while significantly improving cross-task and cross-morphology generalization. Evaluated on both simulation and real-robot benchmarks, FASTer achieves higher task success rates with substantially faster inference speed, consistently outperforming existing state-of-the-art methods.
π Abstract
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.