🤖 AI Summary
Sim-to-real policy degradation in Hidden-Parameter Markov Decision Processes (HIP-MDPs)—common in autonomous driving and robotic manipulation—is caused by unobservable latent variables that induce dynamics mismatch between simulation and reality.
Method: We propose the first model-based reinforcement learning framework explicitly estimating latent parameters. Our approach employs a dual-loop architecture to recursively infer hidden parameters online from limited historical data, then conditions the dynamics model, policy network, and value network on these estimates. Built upon the Dreamer architecture, it jointly optimizes actor-critic networks while integrating latent-state recursive inference and parameter-conditioned modeling.
Contribution/Results: The method achieves significant improvements over state-of-the-art model-based, model-free, and domain-adaptation methods across five HIP-MDP benchmarks. Ablation studies confirm that explicit latent-parameter estimation and conditional injection are critical for rapid adaptation to unseen environmental shifts.
📝 Abstract
Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden pa- rameters, ranging from autonomous driving to robotic manipu- lation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden- parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing ap- proaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden param- eters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce Privileged- Dreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and do- main adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.