🤖 AI Summary
Existing human-in-the-loop reinforcement learning (HiL-RL) methods rely heavily on frequent human interventions to correct inefficient exploration, resulting in high costs and limited scalability. This work proposes UniIntervene, a novel framework that shifts the burden of intervention from humans to the agent itself. By leveraging future-conditioned action-value estimation and a temporal value-risk critic, the agent autonomously detects policy stagnation, retrieves high-value targets from an intervention memory bank, and generates corrective actions via a goal-conditioned recovery policy. This approach marks a paradigm shift from passive error correction to proactive, value-aware recovery. Evaluated across diverse real-world manipulation tasks, UniIntervene achieves an average 8.6% improvement in success rate and reduces human intervention by 57% compared to state-of-the-art HiL-RL methods.
📝 Abstract
Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.