Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the alignment of generative models with multiple human preference objectives—such as helpfulness and harmlessness—where conflicting goals complicate optimization. We propose MOPO, the first multi-objective preference optimization framework, which formulates alignment as a KL-regularized multi-objective optimization problem subject to safety threshold constraints. MOPO requires neither pointwise reward assumptions nor prompt engineering, enables closed-form iterative updates, and provably converges to the Pareto frontier. Crucially, it provides the first explicit, interpretable modeling of trade-offs among conflicting objectives, and is scalable to large language models. On synthetic benchmarks and real human preference data (using a 1.3B-parameter model), MOPO achieves significant improvements across multiple reward dimensions, demonstrates clear Pareto superiority over baselines, and exhibits robust training dynamics and hyperparameter insensitivity.

Technology Category

Application Category

📝 Abstract

Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multiple conflicting human objectives in generative models

Addressing lack of natural aggregation for diverse human preferences

Improving alignment without heuristic reward assumptions or engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Objective Preference Optimization algorithm introduced

Constrained KL-regularized optimization for alignment

Operates directly on pairwise preference data

🔎 Similar Papers

No similar papers found.

Authors to Follow