Evaluation-Time Policy Switching for Offline Reinforcement Learning

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Offline reinforcement learning (Offline RL) faces two key challenges: poor policy generalization and distributional shift—existing methods tend to overestimate out-of-distribution actions and require hyperparameter re-tuning for cross-task or cross-dataset transfer. To address this, we propose a dynamic policy switching mechanism at evaluation time, which adaptively fuses an offline RL policy with a behavior cloning policy based on dual uncertainty estimates: epistemic uncertainty (quantified via Monte Carlo Dropout) and aleatoric uncertainty (estimated via data density modeling). Our approach enables zero-shot cross-task transfer without parameter tuning and supports safe, zero-sample fine-tuning from offline to online settings. Evaluated on standard Offline RL benchmarks, it consistently outperforms state-of-the-art methods, achieving faster fine-tuning convergence and superior final performance.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they tend to over-estimate the behaviour of out of distributions actions. Existing offline RL algorithms adapt off-policy algorithms, employing techniques such as constraining the policy or modifying the value function to achieve good performance on individual datasets but struggle to adapt to different tasks or datasets of different qualities without tuning hyper-parameters. We introduce a policy switching technique that dynamically combines the behaviour of a pure off-policy RL agent, for improving behaviour, and a behavioural cloning (BC) agent, for staying close to the data. We achieve this by using a combination of epistemic uncertainty, quantified by our RL model, and a metric for aleatoric uncertainty extracted from the dataset. We show empirically that our policy switching technique can outperform not only the individual algorithms used in the switching process but also compete with state-of-the-art methods on numerous benchmarks. Our use of epistemic uncertainty for policy switching also allows us to naturally extend our method to the domain of offline to online fine-tuning allowing our model to adapt quickly and safely from online data, either matching or exceeding the performance of current methods that typically require additional modification or hyper-parameter fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Overcoming over-estimation of out-of-distribution actions in offline RL.

Adapting offline RL algorithms to diverse tasks without hyper-parameter tuning.

Enhancing offline to online fine-tuning using epistemic uncertainty metrics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic policy switching between off-policy RL and BC

Uses epistemic and aleatoric uncertainty for decision-making

Enables offline to online fine-tuning without hyper-parameter tuning

🔎 Similar Papers

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning