What Is Preference Optimization Doing, How and Why?

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This paper investigates the fundamental differences between DPO (supervised preference optimization) and PPO (reinforcement-learning-based preference optimization) in aligning large language models with human preferences. We analyze gradient direction stability, perform component-wise ablation experiments, and model token-level advantage–loss reweighting correlations. Our analysis—framed through the lens of objective dynamics—reveals that DPO implicitly assumes static targets and induces regularization, whereas PPO relies on dynamic advantage estimation and requires negative samples to drive exploration. We further show that loss reweighting acts as implicit regularization, while negative learning enhances policy-space exploration. Ablation studies confirm that explicitly controlling optimization dynamics significantly improves both training efficiency and alignment performance. This work establishes a novel theoretical framework for preference optimization and provides principled guidance for designing more efficient and robust alignment algorithms.

Technology Category

Application Category

📝 Abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyzes optimization dynamics of preference optimization methods in LLMs.

Compares DPO and PPO to understand their distinct algorithmic behaviors.

Examines roles of key components like positive learning and loss reweighting.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes gradient update targets in preference optimization methods

Examines roles of positive, negative learning and loss reweighting

Conducts ablation studies to control optimization dynamics impact

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization