DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

๐Ÿ“… 2024-06-16
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Hierarchical reinforcement learning (HRL) for complex robotic tasks suffers from scarcity of human preference data and is prone to non-stationarity and infeasible subgoals. Method: We propose the first method integrating direct preference optimization (DPO) into an HRL framework, featuring a primitive-guided dual-level optimization regularization scheme: high-level policies abstract task structure via preference signals, while low-level policies execute actions via standard RLโ€”eliminating the need for explicit reward engineering. Contribution/Results: Our approach substantially mitigates subgoal infeasibility and training instability. Evaluated on multiple embodied robotics tasks, it outperforms both hierarchical and end-to-end baselines, demonstrating the effectiveness and robustness of preference-driven hierarchical learning without handcrafted rewards.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning control policies to perform complex robotics tasks from human preference data presents significant challenges. On the one hand, the complexity of such tasks typically requires learning policies to perform a variety of subtasks, then combining them to achieve the overall goal. At the same time, comprehensive, well-engineered reward functions are typically unavailable in such problems, while limited human preference data often is; making efficient use of such data to guide learning is therefore essential. Methods for learning to perform complex robotics tasks from human preference data must overcome both these challenges simultaneously. In this work, we introduce DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning, an efficient hierarchical approach that leverages direct preference optimization to learn a higher-level policy and reinforcement learning to learn a lower-level policy. DIPPER enjoys improved computational efficiency due to its use of direct preference optimization instead of standard preference-based approaches such as reinforcement learning from human feedback, while it also mitigates the well-known hierarchical reinforcement learning issues of non-stationarity and infeasible subgoal generation due to our use of primitive-informed regularization inspired by a novel bi-level optimization formulation of the hierarchical reinforcement learning problem. To validate our approach, we perform extensive experimental analysis on a variety of challenging robotics tasks, demonstrating that DIPPER outperforms hierarchical and non-hierarchical baselines, while ameliorating the non-stationarity and infeasible subgoal generation issues of hierarchical reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Human Preferences
Hierarchical Reinforcement Learning
Stability and Rationality Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

DIPPER method
hierarchical learning
reinforcement learning
U
Utsav Singh
CSE Deptt., IIT Kanpur, India
Souradip Chakraborty
Souradip Chakraborty
University of Maryland, College Park | Past : ML Research@Walmart Labs
Reinforcement LearningDeep LearningRobustnessUncertainty
W
Wesley A. Suttle
U.S. Army Research Laboratory, Adelphi, MD, USA
B
Brian M. Sadler
University of Texas, Austin, Texas, USA
Vinay P. Namboodiri
Vinay P. Namboodiri
Department of Computer Science, University of Bath
Computer VisionImage ProcessingMachine Learning
A
A. S. Bedi
CS Deptt., University of Central Florida, Orlando, Florida, USA