Generative Actor Critic

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline pre-trained policies often suffer from poor online fine-tuning performance due to difficulties in accurate policy evaluation and effective improvement under sparse or terminal-only rewards. Method: We reformulate policy evaluation as generative modeling of the trajectory-return joint distribution $p( au, y)$, and introduce a disentangled decision-making framework grounded in continuous latent variables—termed “planning vectors”—to jointly enable goal-directed exploitation and conditional exploration. Our approach integrates latent-variable generative modeling, variational inference, planning-space optimization, and reward-free conditional sampling, enabling robust policy improvement using only terminal returns. Results: Experiments on Gym-MuJoCo and Maze2D demonstrate substantial improvements over state-of-the-art methods: superior offline performance, significantly higher offline-to-online transfer gains, and strong robustness even in the absence of dense reward signals.

Technology Category

Application Category

📝 Abstract
Conventional Reinforcement Learning (RL) algorithms, typically focused on estimating or maximizing expected returns, face challenges when refining offline pretrained models with online experiences. This paper introduces Generative Actor Critic (GAC), a novel framework that decouples sequential decision-making by reframing extit{policy evaluation} as learning a generative model of the joint distribution over trajectories and returns, $p(τ, y)$, and extit{policy improvement} as performing versatile inference on this learned model. To operationalize GAC, we introduce a specific instantiation based on a latent variable model that features continuous latent plan vectors. We develop novel inference strategies for both extit{exploitation}, by optimizing latent plans to maximize expected returns, and extit{exploration}, by sampling latent plans conditioned on dynamically adjusted target returns. Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods, even in absence of step-wise rewards.
Problem

Research questions and friction points this paper is trying to address.

Refines offline pretrained models with online experiences
Decouples policy evaluation and improvement via generative modeling
Enhances offline-to-online learning without step-wise rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples decision-making via generative trajectory-return modeling
Uses latent plan vectors for flexible policy inference
Enables exploitation-exploration through optimized and sampled plans
🔎 Similar Papers
No similar papers found.
A
Aoyang Qin
Department of Automation, Tsinghua University
D
Deqian Kong
Department of Statistics and Data Science, UCLA
W
Wei Wang
Beijing Institute of General Artificial Intelligence (BIGAI)
Ying Nian Wu
Ying Nian Wu
UCLA Department of Statistics and Data Science
Generative AIRepresentation learningComputer visionComputational neuroscienceBioinformatics
S
Song-Chun Zhu
Department of Automation, Tsinghua University
Sirui Xie
Sirui Xie
Research Scientist, Google DeepMind
Machine LearningArtificial Intelligence