AlphaPO -- Reward shape matters for LLM alignment

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address likelihood shift and over-optimization in direct preference alignment (e.g., DPO, SimPO) caused by suboptimal reward function geometry, this paper proposes an α-parameterized reward shaping method: a learnable curvature parameter α is introduced to flexibly modulate the nonlinearity of the reward function, enabling fine-grained control over the policy’s output probability distribution. We provide the first systematic analysis revealing the critical impact of reward function geometry—particularly its curvature—on LLM alignment performance, thereby challenging the conventional rigid assumption of logarithmic rewards. Within the Direct Preference Optimization framework, our approach integrates gradient-controllable optimization with preference-data-driven training. Empirical evaluation on instruction-tuning tasks using Mistral-7B and Llama3-8B shows consistent improvements: alignment performance increases by 7–10% over SimPO, with significant mitigation of likelihood shift.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Examples include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an $alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B. The analysis and results presented highlight the importance of the reward shape, and how one can systematically change it to affect training dynamics, as well as improve alignment performance.

Problem

Research questions and friction points this paper is trying to address.

Decision-making Alignment

Language Model Training

Human Preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

AlphaPO

Reward Shaping

Large Language Model Optimization

🔎 Similar Papers

No similar papers found.