PAWS: Preference Learning with Advantage-Weighted Segments

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the issue of temporal credit assignment failure in preference-based reinforcement learning, which arises from distributional mismatches between training and inference phases. To resolve this, the paper introduces a trajectory-level preference learning approach that directly leverages trajectory-level advantage functions for policy updates. This method is the first to achieve distributional consistency between utility learning and policy optimization, thereby preserving trajectory-level preference information while circumventing unreliable per-timestep reward signals. Technically, it integrates trajectory-level advantage weighting, preference modeling, and policy gradient estimation into a unified framework. Experimental results demonstrate that the proposed approach significantly outperforms existing preference-based reinforcement learning algorithms on simulated robotic manipulation and locomotion tasks.

📝 Abstract

Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose PAWS, a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

Problem

Research questions and friction points this paper is trying to address.

Preference-based reinforcement learning

distribution shift

temporal credit assignment

policy optimization

segment-level preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference-based reinforcement learning

Advantage-weighted segments

Distribution-consistent learning