CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current robot manipulation policies exhibit poor generalization under unseen execution variations, largely because existing attention mechanisms fail to model temporal structures—such as failure-recovery patterns—embedded in demonstrations. To address this, we propose State-Transition Attention, a novel attention mechanism explicitly designed to learn state evolution dynamics; it is integrated into a Transformer architecture and augmented with temporal masking during training to strengthen sequential reasoning over historical context. We benchmark our approach against visual dynamic masking, Temporal Convolutional Networks (TCNs), and Long Short-Term Memory (LSTM) networks. In simulation, our method significantly outperforms standard cross-attention as well as TCN and LSTM baselines, achieving over a two-fold improvement in high-precision manipulation tasks. These results empirically validate that explicit modeling of temporal structure is critical for enhancing policy robustness and generalization.

Technology Category

Application Category

📝 Abstract

Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard cross-attention and temporal modeling approaches like TCN and LSTM networks across all tasks, achieving more than 2x improvement over cross-attention on precision-critical tasks.

Problem

Research questions and friction points this paper is trying to address.

Learning robotic manipulation policies robust to execution variations

Modeling temporal structure in demonstrations like failure patterns

Improving adaptation to execution history through structured attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Transition Attention mechanism modulates attention weights

Temporal masking removes visual information for reasoning

Combines structured attention with temporal masking training

🔎 Similar Papers

Transformer-based deep imitation learning for dual-arm robot manipulation