Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses key challenges in offline meta-reinforcement learning—namely, poor generalization, model misuse, and difficulty adapting under sparse rewards—stemming from distributional shifts between context and policy. To mitigate these issues, the authors propose a novel approach that integrates information-theoretic task representations with a Transformer-based stochastic world model. Central to this framework is a behavior-invariant mechanism for learning task latent variables, which effectively alleviates context distribution shift. Additionally, conservative value penalties are incorporated into imagined rollouts to curb policy divergence and prevent overexploitation of the learned model. Empirical results demonstrate that the proposed method significantly outperforms existing approaches in both out-of-distribution and sparse-reward settings, exhibiting enhanced robustness, stability, and generalization capability.

📝 Abstract

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

Problem

Research questions and friction points this paper is trying to address.

offline meta-reinforcement learning

distribution shift

sparse-reward

generalization

behavior policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavior-invariant representation

Transformer-based world model

offline meta-reinforcement learning