Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the alignment challenge in large language models (LLMs) when conflicting objectives—helpfulness, truthfulness, and safety—must be jointly optimized, this paper proposes Group Relative Policy Optimization (GRPO). GRPO introduces a novel intra-group response contrastive learning mechanism grounded in a multi-label reward regression model, explicitly modeling and jointly optimizing multiple alignment dimensions without requiring a value network or critic, thereby significantly reducing computational overhead. We provide theoretical justification for the effectiveness of multi-dimensional score fusion. Empirically, GRPO consistently outperforms PPO-RLHF and DPO across SafetyBench, HelpSteer2, and other benchmarks on 0.5B-, 7B-, and 14B-parameter models, achieving simultaneous improvements in both safety and helpfulness. All code and trained models are publicly released.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, but is complex and can be unstable. Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives~cite{dpo}. In this work, we propose a Group Relative Policy Optimization (GRPO) framework with a multi-label reward regression model to achieve safe and aligned language generation. The GRPO algorithm optimizes a policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency~cite{grpo}. We train a reward model to predict multiple alignment scores (e.g., safety, helpfulness, etc.), which are combined into a single reward signal. We provide a theoretical derivation for using this learned multi-aspect reward within GRPO and discuss its advantages and limitations. Empirically, our approach improves all the safety and quality metrics evaluated in language generation tasks on model scales (0.5B, 7B, and 14B parameters), demonstrating a robust balance of objectives. We compare GRPO to PPO-based RLHF and DPO, highlighting that GRPO achieves alignment with significantly lower computational cost and explicit multi-objective handling. extbf{We will open-source all trained models at https://huggingface.co/hydroxai.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human values and safety constraints
Balancing conflicting objectives like helpfulness and truthfulness
Reducing computational cost in multi-objective optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization (GRPO) framework
Multi-label reward regression model
Efficient multi-objective alignment handling
🔎 Similar Papers
2024-02-05Annual Meeting of the Association for Computational LinguisticsCitations: 37