SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a key limitation in multi-objective reinforcement learning: static reward weighting often fails to account for asynchronous learning dynamics across objectives, allowing noise from converged objectives to interfere with the high-value signals of under-trained ones. To mitigate this, the authors propose SAW—a lightweight, algorithm-agnostic dynamic weighting mechanism that, for the first time, incorporates real-time information content awareness at the reward dimension. SAW leverages the coefficient of variation as a scale-invariant proxy for information content and adaptively adjusts per-objective reward or advantage weights based on batch statistics, introducing negligible computational overhead. Evaluated within GRPO/GDPO frameworks, SAW significantly enhances both training efficiency and final performance on tool-use and text summarization tasks, demonstrating its effectiveness as a general-purpose plug-in for aligning large language models with multiple reward signals.

📝 Abstract

Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension's reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at https://github.com/Zhaolutuan/SAW

Problem

Research questions and friction points this paper is trying to address.

multi-objective reinforcement learning

reward asynchrony

large language models

reward weighting

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-Aware Dynamic Weighting

multi-objective reinforcement learning

coefficient of variation