🤖 AI Summary
This work investigates the mechanism by which pre-trained language models (PLMs) degrade long-horizon planning capability when transferring knowledge to Decision Transformers (DTs). The authors first identify a class of Markovian attention heads—termed “Markov heads”—in PLMs: while effective for short-horizon tasks, they severely impair long-horizon planning, and this degradation persists even under embedding-layer retraining or standard fine-tuning. To address this, they propose GPT-DTMA, a general-purpose adaptation framework featuring a Mixture-of-Attention (MoA) mechanism that enables adaptive switching between attention patterns during fine-tuning. Experiments demonstrate that GPT-DTMA surpasses baselines on short-horizon tasks and fully eliminates the performance drop induced by pre-training in long-horizon environments (e.g., PointMaze), matching the performance of DTs trained from scratch. Overall, GPT-DTMA significantly alleviates the temporal imbalance—i.e., the trade-off between short- and long-horizon task performance—in cross-domain pre-trained DTs.
📝 Abstract
Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.