🤖 AI Summary
This work addresses the alignment of generative models with multiple human preference objectives—such as helpfulness and harmlessness—where conflicting goals complicate optimization. We propose MOPO, the first multi-objective preference optimization framework, which formulates alignment as a KL-regularized multi-objective optimization problem subject to safety threshold constraints. MOPO requires neither pointwise reward assumptions nor prompt engineering, enables closed-form iterative updates, and provably converges to the Pareto frontier. Crucially, it provides the first explicit, interpretable modeling of trade-offs among conflicting objectives, and is scalable to large language models. On synthetic benchmarks and real human preference data (using a 1.3B-parameter model), MOPO achieves significant improvements across multiple reward dimensions, demonstrates clear Pareto superiority over baselines, and exhibits robust training dynamics and hyperparameter insensitivity.
📝 Abstract
Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.