FOSP: Fine-tuning Offline Safe Policy through World Models

📅 2024-07-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

To address the weak offline policy generalization and inefficient online safe exploration of vision-based robots under unknown safety constraints, this paper proposes a novel “offline pretraining + online safe fine-tuning” paradigm. Methodologically, it introduces (1) a world-model-driven offline policy pretraining framework to enhance data efficiency and initial safety, and (2) a reachability-guided in-model optimization and safe policy expansion mechanism enabling in-sample safe fine-tuning. This work is the first to systematically bridge the safety-aware generalization gap from offline to online reinforcement learning. Evaluated on five vision-only simulated tasks and real-robot deployment with limited data, the approach significantly improves safety constraint satisfaction rates and cross-scenario generalization performance, effectively mitigating offline policies’ strong dependence on static datasets.

Technology Category

Application Category

📝 Abstract

Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, a safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot deployment using limited data. It demonstrates that our approach significantly improves the generalization of offline policies to unseen safety-constrained scenarios. To the best of our knowledge, this is the first work to explore offline-to-online RL for safe generalization tasks.

Problem

Research questions and friction points this paper is trying to address.

Improves safety in vision-based robotic tasks deployment.

Enhances generalization of offline policies to unseen scenarios.

Introduces model-based RL for efficient offline-to-online fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based RL for data efficiency

In-sample optimization for offline training

Safe policy expansion for online fine-tuning

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation