ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses a key limitation in existing multi-candidate training methods for large language models, where uniform reward assignment across all candidates leads to weak candidates free-riding on stronger ones, resulting in noisy training signals and inefficient exploration. To overcome this, the study introduces Shapley values from cooperative game theory into the GRPO framework, proposing a theoretically grounded reward decomposition mechanism that satisfies the Shapley axioms and yields fine-grained, individualized rewards for each candidate. The method incorporates a polynomial-time approximation algorithm, ensuring both theoretical rigor and computational feasibility. Experimental results demonstrate that the proposed approach significantly outperforms standard GRPO across multiple datasets, accelerating training convergence and enhancing the overall utility of the generated candidate set.

Technology Category

Application Category

📝 Abstract

In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

Problem

Research questions and friction points this paper is trying to address.

reward allocation

multi-candidate LLM training

collective utility

free-riding

set-level reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley value

reward decomposition

multi-candidate LLM training