🤖 AI Summary
This work addresses the challenge that reinforcement learning policies often fail to generalize to unseen safety constraints, posing risks in real-world dynamic environments. To mitigate this, the authors propose a predictive safety filter that post-processes the policy’s contact-point outputs: upon detecting potential collisions, it asynchronously searches for safe contact sequences by integrating a full physics model with a sampling-based optimizer, guided by a learned value function to preserve long-term reward. The approach innovatively combines geometric projection, momentum-augmented updates, and replica exchange mechanisms to effectively handle safety-critical planning in discontinuous contact spaces. Experiments demonstrate that, in densely cluttered environments, a quadrupedal robot achieves significantly reduced safety violations—both in simulation and on physical hardware—while maintaining locomotion performance closely aligned with the original policy.
📝 Abstract
Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.