🤖 AI Summary
Offline goal-conditioned reinforcement learning (GCRL) faces two key bottlenecks in long-horizon tasks: (1) high-level policies struggle to generate semantically meaningful subgoals, and (2) long-term advantage signals suffer from sign ambiguity due to temporal credit assignment. This paper proposes option-aware temporal abstraction value learning—a novel approach that integrates option-based temporal abstraction directly into TD updates for the first time, enabling both temporal compression of value functions and correction of advantage signal polarity. Built upon the HIQL framework, our method unifies option modeling, temporally abstracted TD learning, and BCQ-style offline policy extraction. Evaluated on the OGBench benchmark—including maze navigation and vision-based robotic manipulation—our method significantly outperforms state-of-the-art baselines (e.g., HIQL) in high-level policy performance, demonstrating superior generalization, training stability, and subgoal generation capability for long-horizon tasks.
📝 Abstract
Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.