CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of specifying and balancing optimization objectives for robots in complex environments, where manual design is difficult and existing active preference learning methods are constrained by fixed trajectory sets, limiting query informativeness and diversity. To overcome this, the authors propose a novel approach that jointly optimizes environment design and trajectory selection for the first time. By leveraging counterfactual reasoning, the method generates trajectory pairs that effectively reveal differences among candidate reward functions, thereby eliciting more informative user preferences. The framework integrates Bayesian reward belief sampling, learnable environment parameterization, and a strategic trajectory-pair generation policy to substantially enhance both the information content and diversity of queries. Experiments demonstrate that the proposed method outperforms existing approaches in reward accuracy and sample efficiency, and achieves higher user ratings in human subject studies.

Technology Category

Application Category

📝 Abstract
As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
Problem

Research questions and friction points this paper is trying to address.

active preference learning
reward inference
trajectory generation
preference elicitation
human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Reasoning
Environment Design
Active Preference Learning
Reward Inference
Trajectory Generation
🔎 Similar Papers
No similar papers found.