Action-Effect Memory Pretraining for Robot Manipulation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge in robotic manipulation where partial observability renders decision-making based solely on the current visual frame ineffective. To overcome this limitation, the authors propose the Action-Effect Memory (AEM) framework, which introduces an action-effect memory mechanism into pretraining for the first time. AEM leverages a Mamba encoder, masked modeling, and interleaved vision-action fusion to learn compact, action-conditioned temporal state representations. The approach establishes a single-vector temporal bottleneck that preserves global context while substantially improving inference efficiency. Evaluated in both simulation and real-world settings, AEM consistently outperforms baseline methods across clean, cluttered, and non-Markovian tasks, while significantly reducing inference latency and computational overhead.

📝 Abstract

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

Problem

Research questions and friction points this paper is trying to address.

robot manipulation

partial observability

temporal representation

action-effect memory

history-aware pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Effect Memory

Temporal Representation Learning

Masked Modeling