Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Open-domain generation tasks suffer from the absence of scalar rewards, high costs of human annotation, and the tendency of existing reinforcement learning approaches to collapse output diversity. To address these challenges, this work proposes PPR-GDE, a novel method that dispenses with traditional scalar rewards and instead constructs a reinforcement learning framework based on pairwise preferences. It mitigates annotator position bias through sequential swapping and introduces, for the first time, a group-level semantic diversity metric as the reward signal, unified under a group-relative policy optimization objective. Experimental results demonstrate that PPR-GDE significantly outperforms strong baselines in role-playing tasks, achieving simultaneous improvements in both generation quality and expressive diversity. These findings validate the efficacy of pairwise preferences for subjective alignment and the contribution of the group diversity mechanism to enhanced semantic coverage.
📝 Abstract
Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.
Problem

Research questions and friction points this paper is trying to address.

open-ended generation
reinforcement learning
diversity collapse
reward modeling
subjective evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise Preference Reward
Group-based Diversity
Reinforcement Learning
Open-ended Generation
Diversity Collapse Mitigation
🔎 Similar Papers
No similar papers found.
G
Guining Cao
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd; School of Software and Microelectronics, Peking University
J
Jiaxin Peng
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd
C
Chu Zeng
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd; Tsinghua University
Yu Zhao
Yu Zhao
University of Electronic Science and Technology of China
video codingvideo compression
S
Shuangyong Song
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd
Y
Yongxiang
Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd