Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL

📅 2024-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline reinforcement learning (RL) policies often suffer from performance degradation and training instability when transferred to online settings, primarily due to *evaluation mismatch* (offline critic overestimation) and *improvement mismatch* (policy–critic misalignment). To address these challenges, we propose O2O-RL, a general offline-to-online transfer framework that jointly models and mitigates both mismatches. Specifically, we introduce an *optimistic critic reconstruction* mechanism to correct value overestimation, and integrate *actor–critic alignment calibration* with KL-constrained policy fine-tuning to suppress distributional shift and behavioral drift. O2O-RL is algorithm-agnostic—compatible with any offline pretraining method and any online RL algorithm. Empirical evaluation across multiple simulated control tasks demonstrates stable, efficient convergence and consistently outperforms current state-of-the-art transfer approaches.

Technology Category

Application Category

📝 Abstract
Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the misaligned critic with the reliable offline actor to avoid erroneous update. After obtaining an optimistic and and aligned critic, we perform constrained fine-tuning to combat distribution shift during online learning. We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

O2O Reinforcement Learning
Online Adaptation
Learning Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Smooth Transition
Adaptive Learning
Content Relevance Tuning
🔎 Similar Papers
No similar papers found.
Q
Qin-Wen Luo
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Ming-Kun Xie
Ming-Kun Xie
RIKEN Center for Advanced Intelligence Project
machine learning
Y
Ye-Wen Wang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang
Sheng-Jun Huang
Nanjing University of Aeronautics & Astronautics
Machine Learning