🤖 AI Summary
This work proposes a recurrent-depth vision-language-action (VLA) architecture based on implicit iterative inference, addressing the limitation of existing VLA models that employ fixed computational depth during inference and cannot dynamically allocate computation according to task complexity. By leveraging a weight-shared recurrent action head, the model supports arbitrary inference depths while maintaining constant memory consumption. This approach introduces implicit latent iteration into VLA for the first time, combining truncated backpropagation with a convergence-based adaptive stopping mechanism in the latent space to enable test-time computational scalability. On complex manipulation tasks, the method boosts success rates from 0% to over 90% with four inference iterations, achieving up to an 80× speedup over prior approaches without increasing memory overhead.
📝 Abstract
Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/