RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Role-playing conversational agents (RPCAs) suffer from persistent challenges in maintaining role consistency during dialogue generation. To address this, we propose a GRPO-based reinforcement learning framework grounded in Verifiable Role-Aware Rewards (VRAR). First, we introduce a role-key mining approach—leveraging single- and multi-term role descriptors—to generate quantifiable, verifiable rewards for role consistency. Second, we employ collaborative large language models (LLMs) to construct a high-quality, role-aware chain-of-thought (CoT) dataset, enabling the first quantitative evaluation and optimization of role awareness. This framework bridges the critical gap in RPCA training: the lack of quantifiable, verifiable reward signals. Evaluated on the RAIDEN benchmark, our 14B-GRPO model achieves 88.04% on Script-Based Knowledge and 88.65% on Conversation Memory—substantially outperforming all baselines—and demonstrates exceptional robustness in resolving conflicting contexts and preserving first-person narrative consistency.

Technology Category

Application Category

📝 Abstract
Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1's superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model's enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing role consistency in role-playing conversational agents (RPCAs)
Developing verifiable role-awareness rewards for reinforcement learning
Improving reasoning coherence via role-aware Chain-of-Thought datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO reinforcement learning with verifiable rewards
Multi-term mining for quantifiable role-awareness rewards
Multi-LLM collaboration for role-aware dataset
🔎 Similar Papers
No similar papers found.
Zongsheng Wang
Zongsheng Wang
Platform and Content Group, Tencent
K
Kaili Sun
Platform and Content Group, Tencent
B
Bowen Wu
School of Software & Microelectronics, Peking University, Beijing, China; Platform and Content Group, Tencent
Q
Qun Yu
Platform and Content Group, Tencent
Y
Ying Li
School of Software & Microelectronics, Peking University, Beijing, China
Baoxun Wang
Baoxun Wang
PCG, Tencent
Natural Language ProcessingDeep LearningChat-Bot