Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses key challenges in offline meta-reinforcement learning—namely, poor generalization, model misuse, and difficulty adapting under sparse rewards—stemming from distributional shifts between context and policy. To mitigate these issues, the authors propose a novel approach that integrates information-theoretic task representations with a Transformer-based stochastic world model. Central to this framework is a behavior-invariant mechanism for learning task latent variables, which effectively alleviates context distribution shift. Additionally, conservative value penalties are incorporated into imagined rollouts to curb policy divergence and prevent overexploitation of the learned model. Empirical results demonstrate that the proposed method significantly outperforms existing approaches in both out-of-distribution and sparse-reward settings, exhibiting enhanced robustness, stability, and generalization capability.
📝 Abstract
Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.
Problem

Research questions and friction points this paper is trying to address.

offline meta-reinforcement learning
distribution shift
sparse-reward
generalization
behavior policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

behavior-invariant representation
Transformer-based world model
offline meta-reinforcement learning
conservative value penalty
task representation learning
F
Fuyuan Qian
Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China
M
Menglong Zhang
AI Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China
Song Wang
Song Wang
南方科技大学
Q
Quanying Liu
Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China; Omni-Intelligence, Shenzhen, Guangdong, China; Shenzhen Loop Area Institute, Shenzhen, Guangdong, China