$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the failure of vision-language-action (VLA) models in partially observable environments due to the absence of historical memory. To remedy this, the authors propose an extremely lightweight recurrent memory mechanism that introduces only a small set of learnable memory tokens into a pretrained VLA backbone. These tokens are updated across timesteps via self-attention and enable end-to-end training without any architectural modifications or auxiliary losses. This approach facilitates, for the first time, a controlled and isolated study of recurrence in VLA systems. Experiments demonstrate substantial performance gains: on MIKASA-Robo, average success rates on trained tasks improve from 0.42 to 0.84, while zero-shot task performance rises to 0.23 from a baseline of 0.07. Moreover, the method maintains a high success rate of 96.2% on fully observable LIBERO tasks.

📝 Abstract

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

Problem

Research questions and friction points this paper is trying to address.

partially observable environments

vision-language-action models

recurrent memory

memory-augmented models

action prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

recurrent memory

vision-language-action models

partial observability