A Regret Minimization Framework on Preference Learning in Large Language Models

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of aligning large language models with human feedback in language tasks where reliable automatic verifiers are unavailable. The authors propose Regret-based Preference Optimization (RePO), a novel approach that introduces regret minimization into preference learning for large language models. By modeling relative suboptimality through behavior-conditioned counterfactual reasoning, RePO better captures the human cognitive mechanisms of prospective and counterfactual comparison. The method establishes an optimization framework that integrates reinforcement learning with human preference data. Extensive experiments demonstrate consistent performance gains across mathematical reasoning benchmarks and human preference datasets, significantly enhancing both model capabilities and alignment with human judgments.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

Problem

Research questions and friction points this paper is trying to address.

preference learning

human feedback

regret minimization

large language models

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

regret minimization

preference learning

reinforcement learning from human feedback