AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency and computational overhead of Vision-Language-Action (VLA) models in robotic manipulation, which stem from repeatedly executing large vision-language backbones at each step while existing acceleration methods neglect action context. To overcome this, the authors propose AC²-VLA, a novel framework that, for the first time, incorporates action context—integrating current visual inputs, language instructions, and historical action states—into an adaptive computation mechanism. This enables cognitive reuse and selective execution across temporal, spatial, and depth dimensions through token pruning and dynamic model component activation. Furthermore, an action-guided self-distillation strategy is introduced to achieve structured sparsity generalizable across tasks and environments. Evaluated on standard robotic manipulation benchmarks, AC²-VLA achieves up to 1.79× speedup with FLOPs reduced to 29.4% of the baseline while maintaining comparable task success rates.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have demonstrated strong performance in robotic manipulation, yet their closed-loop deployment is hindered by the high latency and compute cost of repeatedly running large vision-language backbones at every timestep. We observe that VLA inference exhibits structured redundancies across temporal, spatial, and depth dimensions, and that most existing efficiency methods ignore action context, despite its central role in embodied tasks. To address this gap, we propose Action-Context-aware Adaptive Computation for VLA models (AC^2-VLA), a unified framework that conditions computation on current visual observations, language instructions, and previous action states. Based on this action-centric context, AC^2-VLA adaptively performs cognition reuse across timesteps, token pruning, and selective execution of model components within a unified mechanism. To train the adaptive policy, we introduce an action-guided self-distillation scheme that preserves the behavior of the dense VLA policy while enabling structured sparsification that transfers across tasks and settings. Extensive experiments on robotic manipulation benchmarks show that AC^2-VLA achieves up to a 1.79\times speedup while reducing FLOPs to 29.4% of the dense baseline, with comparable task success.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
computational efficiency
action context
robotic manipulation
adaptive computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Context-Aware
Adaptive Computation
Vision-Language-Action Models
Structured Sparsification
Self-Distillation