DoublyAware: Dual Planning and Policy Awareness for Temporal Difference Learning in Humanoid Locomotion

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address inefficient exploration and policy instability in humanoid robot gait learning—caused by the coupling of environmental stochasticity and model uncertainty—this paper proposes a Dual-Perception TD-MPC framework. Our method orthogonally decouples uncertainty for the first time: conformal prediction quantifies planning uncertainty to enable risk-aware action filtering; group-wise relative policy constraints (GRPC) model policy uncertainty, integrated with latent-space adaptive trust regions for joint optimization. The framework unifies high-confidence behavior selection with targeted exploration. Evaluated on HumanoidBench and the Unitree H1-2 (26-DoF) physical platform, it significantly improves sample efficiency, convergence speed, and motion feasibility over state-of-the-art reinforcement learning baselines.

Technology Category

Application Category

📝 Abstract

Achieving robust robot learning for humanoid locomotion is a fundamental challenge in model-based reinforcement learning (MBRL), where environmental stochasticity and randomness can hinder efficient exploration and learning stability. The environmental, so-called aleatoric, uncertainty can be amplified in high-dimensional action spaces with complex contact dynamics, and further entangled with epistemic uncertainty in the models during learning phases. In this work, we propose DoublyAware, an uncertainty-aware extension of Temporal Difference Model Predictive Control (TD-MPC) that explicitly decomposes uncertainty into two disjoint interpretable components, i.e., planning and policy uncertainties. To handle the planning uncertainty, DoublyAware employs conformal prediction to filter candidate trajectories using quantile-calibrated risk bounds, ensuring statistical consistency and robustness against stochastic dynamics. Meanwhile, policy rollouts are leveraged as structured informative priors to support the learning phase with Group-Relative Policy Constraint (GRPC) optimizers that impose a group-based adaptive trust-region in the latent action space. This principled combination enables the robot agent to prioritize high-confidence, high-reward behavior while maintaining effective, targeted exploration under uncertainty. Evaluated on the HumanoidBench locomotion suite with the Unitree 26-DoF H1-2 humanoid, DoublyAware demonstrates improved sample efficiency, accelerated convergence, and enhanced motion feasibility compared to RL baselines. Our simulation results emphasize the significance of structured uncertainty modeling for data-efficient and reliable decision-making in TD-MPC-based humanoid locomotion learning.

Problem

Research questions and friction points this paper is trying to address.

Addresses robust humanoid locomotion learning under environmental stochasticity.

Decomposes uncertainty into planning and policy components for interpretability.

Enhances sample efficiency and motion feasibility in TD-MPC frameworks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes uncertainty into planning and policy components

Uses conformal prediction for robust trajectory filtering

Applies GRPC optimizers for adaptive trust-region learning

🔎 Similar Papers

Deep Reinforcement Learning for Bipedal Locomotion: A Brief Survey

2024-04-25arXiv.orgCitations: 3

Authors to Follow