🤖 AI Summary
To address safety risks arising from environmental discrepancies in simulation-to-real (Sim-to-Real) transfer for reinforcement learning, this paper proposes a zero-shot safe transfer method. The core innovation integrates pessimistic optimization into the domain randomization framework: it explicitly models uncertainty in the simulation–reality gap and incorporates conservative estimates into safety constraints, thereby providing provably safe policies without real-world fine-tuning. The method is fully compatible with standard RL training pipelines and enables end-to-end, scalable learning of safety-aware policies. Evaluated on multiple simulation benchmarks and two physical robot platforms, our approach significantly improves cross-domain safety during transfer while preserving strong task performance. Key contributions include: (i) a theoretically grounded pessimistic domain randomization framework for Sim-to-Real safety; (ii) zero-shot safety guarantees under model uncertainty; and (iii) empirical validation demonstrating robust safety and performance across diverse robotic tasks.
📝 Abstract
Safety remains a major concern for deploying reinforcement learning (RL) in real-world applications. Simulators provide safe, scalable training environments, but the inevitable sim-to-real gap introduces additional safety concerns, as policies must satisfy constraints in real-world conditions that differ from simulation. To address this challenge, robust safe RL techniques offer principled methods, but are often incompatible with standard scalable training pipelines. In contrast, domain randomization, a simple and popular sim-to-real technique, stands out as a promising alternative, although it often results in unsafe behaviors in practice. We present SPiDR, short for Sim-to-real via Pessimistic Domain Randomization -- a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.