🤖 AI Summary
This work addresses the challenge of mitigating social biases in large language models trained under reward environments characterized by high variance and strong subjectivity. To this end, we propose BiasGRPO, a novel framework that introduces Group Relative Policy Optimization (GRPO) to bias mitigation for the first time. By replacing conventional value functions with intra-group reward normalization, BiasGRPO circumvents reliance on unstable critic estimates while preserving online exploration capabilities and enhancing alignment stability. We further develop a lightweight, efficient, and knowledge-preserving custom bias reward model, combined with a synthetically constructed multi-domain bias dataset and multi-objective RLHF alignment techniques. Experimental results demonstrate that BiasGRPO significantly outperforms both DPO and PPO across multiple benchmarks, achieving superior bias reduction and training stability. The proposed reward model is publicly released to support future research.
📝 Abstract
Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.