🤖 AI Summary
To address the challenge of learning optimal policies from suboptimal reward-labeled trajectories in offline reinforcement learning, this paper proposes Decision TTT (DT3), the first decision-sequence modeling framework that integrates test-time training (TTT) loops with conditional diffusion models. Methodologically, DT3 jointly models dynamic context via TTT layers and self-attention to enable trajectory stitching, and—uniquely within the Decision Transformer paradigm—incorporates a diffusion model for end-to-end, fine-grained action generation under unified objective optimization. Evaluated on the D4RL benchmark (Gym and AntMaze domains), DT3 significantly outperforms the standard Decision Transformer and existing state-of-the-art offline RL and DT variants. These results demonstrate DT3’s effectiveness in modeling suboptimal trajectories and synthesizing high-quality, robust policies.
📝 Abstract
Decision Transformer (DT), a trajectory modeling method, has shown competitive performance compared to traditional offline reinforcement learning (RL) approaches on various classic control tasks. However, it struggles to learn optimal policies from suboptimal, reward-labeled trajectories. In this study, we explore the use of conditional generative modeling to facilitate trajectory stitching given its high-quality data generation ability. Additionally, recent advancements in Recurrent Neural Networks (RNNs) have shown their linear complexity and competitive sequence modeling performance over Transformers. We leverage the Test-Time Training (TTT) layer, an RNN that updates hidden states during testing, to model trajectories in the form of DT. We introduce a unified framework, called Diffusion-Refined Decision TTT (DRDT3), to achieve performance beyond DT models. Specifically, we propose the Decision TTT (DT3) module, which harnesses the sequence modeling strengths of both self-attention and the TTT layer to capture recent contextual information and make coarse action predictions. We further integrate DT3 with the diffusion model using a unified optimization objective. With experiments on multiple tasks of Gym and AntMaze in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art conventional offline RL and DT-based methods.