🤖 AI Summary
This work addresses the challenge of enforcing runtime constraints—such as obstacle avoidance, joint limits, and center-of-mass stability—during deployment of reinforcement learning policies on humanoid robots. The authors propose ConstrainedMimic, a novel framework that, for the first time, integrates differentiable control barrier functions (CBFs) with full-body dynamics into a reinforcement learning-based tracking policy. By embedding kinematic and dynamic constraints directly into operational space control, the method enforces safety and feasibility in real time while minimally perturbing the original policy, and it flexibly accommodates new constraints post-training. The fully differentiable system supports deployment on CPU, GPU, or TPU, achieving 300–500 Hz real-time control in simulation on the Unitree G1 platform, and demonstrates high-fidelity whole-body motion tracking and teleoperation under self-collision and external obstacle avoidance, joint limit, and center-of-mass stability constraints.
📝 Abstract
Recent advances in reinforcement learning (RL) have demonstrated impressive whole-body agility for humanoid robots, yet ensuring safety and satisfying constraints -- particularly those specified after training -- remains a challenge. Towards this goal, we present ConstrainedMimic, a control framework that leverages whole-body kinematics and dynamics for real-time constraint enforcement within RL tracking policies. By integrating principles from operational space control and control barrier functions (CBFs), we enable the satisfaction of arbitrary runtime constraints on both the kinematic reference motion and the underlying dynamics. In whole-body motion-tracking and teleoperation experiments on a (simulated) Unitree G1 with a learned policy, we demonstrate collision avoidance (both with the robot body and external obstacles), joint limits, and center of mass stability constraints. By remaining consistent with the current contact mode and tracking objectives, we minimally restrict the capabilities of the policy when constraints are active. Our method is fully differentiable, runs on CPU, GPU, and TPU, and can be deployed at up to 300-500 Hz. All software will be freely available upon publication.