Meta-Learning Objectives for Preference Optimization

📅 2024-11-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the high evaluation cost, substantial noise, and poor generalizability of Preference Optimization (PO) algorithms in aligning large language models (LLMs). To this end, we introduce MuJoCo—a lightweight, low-noise, and controllable diagnostic benchmark specifically designed for PO. Methodologically, we propose the first meta-learning objective design paradigm tailored to PO, construct a differentiable + evolutionary joint search framework, and develop the Mirror PO (MPO) algorithm family based on mirror descent, enabling automatic discovery of data-adaptive, customized PO algorithms. Key contributions are: (1) The MuJoCo benchmark significantly improves the efficiency and reliability of PO algorithm evaluation; (2) Algorithms discovered via MuJoCo, when transferred to LLM alignment tasks, achieve statistically significant improvements over state-of-the-art baselines in both human preference win rates and preference matching accuracy.

Technology Category

Application Category

📝 Abstract

Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on much simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a novel PO algorithm that significantly outperforms existing baselines in an LLM alignment task.

Problem

Research questions and friction points this paper is trying to address.

Evaluate PO algorithms efficiently

Develop controlled, cheaper benchmarks

Propose novel PO algorithms for LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mirror Preference Optimization algorithm

MuJoCo tasks for systematic evaluation

Evolutionary strategies for specialized algorithms

🔎 Similar Papers

No similar papers found.

Netflix

$466,000.00 - $750,000.00

Los Gatos,California,United States of America / Los Angeles,California,United States of America

Research Scientist Intern, Multimodal AI (PhD)