Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention

📅 2024-09-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the mechanism by which pre-trained language models (PLMs) degrade long-horizon planning capability when transferring knowledge to Decision Transformers (DTs). The authors first identify a class of Markovian attention heads—termed “Markov heads”—in PLMs: while effective for short-horizon tasks, they severely impair long-horizon planning, and this degradation persists even under embedding-layer retraining or standard fine-tuning. To address this, they propose GPT-DTMA, a general-purpose adaptation framework featuring a Mixture-of-Attention (MoA) mechanism that enables adaptive switching between attention patterns during fine-tuning. Experiments demonstrate that GPT-DTMA surpasses baselines on short-horizon tasks and fully eliminates the performance drop induced by pre-training in long-horizon environments (e.g., PointMaze), matching the performance of DTs trained from scratch. Overall, GPT-DTMA significantly alleviates the temporal imbalance—i.e., the trade-off between short- and long-horizon task performance—in cross-domain pre-trained DTs.

Technology Category

Application Category

📝 Abstract
Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.
Problem

Research questions and friction points this paper is trying to address.

Identify knowledge transferred from PLMs to offline RL
Analyze Markov heads' limitations in long-term environments
Propose GPT2-DTMA to improve attention diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies Markov heads in PLMs for RL
Proposes GPT2-DTMA with Mixture of Attention
Improves performance in long-term environments
🔎 Similar Papers
No similar papers found.
W
Wenhao Zhao
Computer Science Department, Renmin University of China, Beijing, China
Q
Qiushui Xu
Department of Industrial and Manufacturing Engineering, Pennsylvania State University, University Park, PA, USA
Linjie Xu
Linjie Xu
Queen Mary University of London
Reinforcement Learning
L
Lei Song
Microsoft Research Asia, Beijing, China
J
Jinyu Wang
Microsoft Research Asia, Beijing, China
C
Chunlai Zhou
Computer Science Department, Renmin University of China, Beijing, China
J
Jiang Bian
Microsoft Research Asia, Beijing, China