SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low human feedback efficiency and poor sample efficiency in preference-based reinforcement learning (PbRL), this paper proposes SENIOR. First, it introduces a Motion-Discriminative Screening (MDS) mechanism that leverages kernel density estimation over state distributions to automatically identify easily distinguishable, task-relevant trajectory pairs, thereby enhancing query human-friendliness. Second, it designs a Preference-Guided Exploration (PGE) module that converts the learned preference model into intrinsic rewards, enabling value-directed active exploration. SENIOR integrates preference modeling, intrinsic reward shaping, online policy optimization, and trajectory contrastive learning. Evaluated on six simulated and four real-robot manipulation tasks, SENIOR significantly improves feedback efficiency and policy convergence speed, consistently outperforming five state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.
Problem

Research questions and friction points this paper is trying to address.

Improves human feedback-efficiency in PbRL
Accelerates policy learning with preference-guided rewards
Enhances sample efficiency for complex robot tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-Distinction-based Selection for meaningful segments
Preference-guided exploration with intrinsic rewards
Synergy between MDS and PGE accelerates learning
H
Hexian Ni
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
T
Tao Lu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
H
Haoyuan Hu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Yinghao Cai
Yinghao Cai
Institute of Automation, Chinese Academy of Sciences
S
Shuo Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China