🤖 AI Summary
In multi-objective sequential decision-making, conventional goal-conditioned (GC) policies—optimized solely for the current goal—often cause path blockages and render subsequent goals unreachable due to neglect of inter-goal dependencies.
Method: We propose Dual-Objective Conditional MDPs and Two-Step Lookahead Goal-Conditioned MDPs, the first frameworks to explicitly model sequential goal dependencies within the GC policy’s conditioning mechanism. Built upon the TD3+HER architecture, our approach establishes a novel hierarchical RL paradigm that jointly optimizes for both the current and subsequent goal reachability during policy training.
Contribution/Results: Evaluated on navigation and inverted pendulum tasks, our method significantly improves policy stability and sample efficiency, consistently outperforming standard GC-MDPs and single-objective GC baselines. It effectively overcomes the local optimization limitation inherent in traditional GC policies by incorporating lookahead goal dependencies into the conditioning structure.
📝 Abstract
Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.