Action-Effect Memory Pretraining for Robot Manipulation

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in robotic manipulation where partial observability renders decision-making based solely on the current visual frame ineffective. To overcome this limitation, the authors propose the Action-Effect Memory (AEM) framework, which introduces an action-effect memory mechanism into pretraining for the first time. AEM leverages a Mamba encoder, masked modeling, and interleaved vision-action fusion to learn compact, action-conditioned temporal state representations. The approach establishes a single-vector temporal bottleneck that preserves global context while substantially improving inference efficiency. Evaluated in both simulation and real-world settings, AEM consistently outperforms baseline methods across clean, cluttered, and non-Markovian tasks, while significantly reducing inference latency and computational overhead.
📝 Abstract
We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.
Problem

Research questions and friction points this paper is trying to address.

robot manipulation
partial observability
temporal representation
action-effect memory
history-aware pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Effect Memory
Temporal Representation Learning
Masked Modeling
Mamba Architecture
Robot Manipulation Pretraining
Y
Yijing Zhou
Hong Kong University of Science and Technology (Guangzhou), Shenzhen University
Q
Qiwei Liang
Hong Kong University of Science and Technology (Guangzhou), Shenzhen University
S
Sitong Zhuang
Shenzhen University
J
Jiaxi Li
Shenzhen University
X
Xianpeng Wang
Hong Kong University of Science and Technology (Guangzhou)
B
Boyang Cai
Hong Kong University of Science and Technology (Guangzhou), Shenzhen University
Y
Yunyang Mo
Hong Kong University of Science and Technology (Guangzhou)
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing