Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In online reinforcement learning, human scalar feedback suffers from high noise, temporal inconsistency, and poor generalization. To address these challenges, this paper proposes Pref-GUIDE: a framework that dynamically converts real-time scalar feedback into short-horizon behavioral preference pairs; employs fuzzy feedback filtering to mitigate temporal inconsistency; and introduces a crowd-sourced reward modeling mechanism—leveraging sliding-window pairwise comparisons and multi-user voting to establish robust consensus preferences. Pref-GUIDE integrates preference-based policy optimization, real-time feedback transformation, and ensemble reward modeling. Evaluated on three complex tasks, it significantly outperforms scalar-feedback baselines; notably, its voting variant even surpasses expert-designed dense reward functions. The approach enables more stable and generalizable online learning driven by human feedback.

Technology Category

Application Category

📝 Abstract
Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Improving reward model learning from noisy real-time human feedback
Mitigating temporal inconsistency in scalar feedback for policy training
Enhancing robustness via consensus preferences from multiple users
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms real-time scalar feedback into preference-based data
Mitigates inconsistency by comparing short-window behaviors
Aggregates user feedback to form consensus preferences
🔎 Similar Papers
No similar papers found.