🤖 AI Summary
This work addresses key challenges in offline meta-reinforcement learning—namely, poor generalization, model misuse, and difficulty adapting under sparse rewards—stemming from distributional shifts between context and policy. To mitigate these issues, the authors propose a novel approach that integrates information-theoretic task representations with a Transformer-based stochastic world model. Central to this framework is a behavior-invariant mechanism for learning task latent variables, which effectively alleviates context distribution shift. Additionally, conservative value penalties are incorporated into imagined rollouts to curb policy divergence and prevent overexploitation of the learned model. Empirical results demonstrate that the proposed method significantly outperforms existing approaches in both out-of-distribution and sparse-reward settings, exhibiting enhanced robustness, stability, and generalization capability.
📝 Abstract
Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.