Towards Robust Zero-Shot Reinforcement Learning

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot reinforcement learning (RL) methods—such as forward-backward representation learning—suffer from limited representational capacity and representation bias induced by out-of-distribution actions, undermining generalization and robustness in offline settings. To address these issues, we propose BREEZE: (1) behavior regularization constrains policy optimization within the behavioral action distribution, mitigating inference error; (2) a task-conditioned diffusion model generates high-fidelity, multimodal action sequences; and (3) an attention-based mechanism enhances representation expressivity. BREEZE unifies policy learning and representation construction within a single framework. Evaluated on ExORL and D4RL Kitchen benchmarks, it achieves state-of-the-art (SOTA) or near-SOTA performance, significantly improving stability, robustness, and cross-task generalization in zero-shot transfer.

Technology Category

Application Category

📝 Abstract
The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: https://github.com/Whiterrrrr/BREEZE.
Problem

Research questions and friction points this paper is trying to address.

Addresses expressivity limitations in zero-shot reinforcement learning representations
Mitigates extrapolation errors from out-of-distribution actions during offline learning
Enhances stability and quality of policy learning in zero-shot RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral regularization stabilizes zero-shot policy learning
Task-conditioned diffusion model generates multimodal actions
Attention-based architecture captures complex environmental dynamics
🔎 Similar Papers
Kexin Zheng
Kexin Zheng
Undergraduate student, the Chinese University of Hong Kong
Autonomous DrivingReinforcement Learning
L
Lauriane Teyssier
Tsinghua University
Yinan Zheng
Yinan Zheng
Tsinghua University
Reinforcement LearningDiffusion ModelsAutonomous DrivingRobotics
Y
Yu Luo
Huawei Noah’s Ark Lab
X
Xiayuan Zhan
Tsinghua University, Shanghai Artificial Intelligence Laboratory