Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the alignment of generative models with multiple human preference objectives—such as helpfulness and harmlessness—where conflicting goals complicate optimization. We propose MOPO, the first multi-objective preference optimization framework, which formulates alignment as a KL-regularized multi-objective optimization problem subject to safety threshold constraints. MOPO requires neither pointwise reward assumptions nor prompt engineering, enables closed-form iterative updates, and provably converges to the Pareto frontier. Crucially, it provides the first explicit, interpretable modeling of trade-offs among conflicting objectives, and is scalable to large language models. On synthetic benchmarks and real human preference data (using a 1.3B-parameter model), MOPO achieves significant improvements across multiple reward dimensions, demonstrates clear Pareto superiority over baselines, and exhibits robust training dynamics and hyperparameter insensitivity.

Technology Category

Application Category

📝 Abstract
Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.
Problem

Research questions and friction points this paper is trying to address.

Optimizing multiple conflicting human objectives in generative models
Addressing lack of natural aggregation for diverse human preferences
Improving alignment without heuristic reward assumptions or engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Objective Preference Optimization algorithm introduced
Constrained KL-regularized optimization for alignment
Operates directly on pairwise preference data
🔎 Similar Papers
No similar papers found.