Curriculum Direct Preference Optimization for Diffusion and Consistency Models

📅 2024-05-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges of preference alignment and insufficient fine-grained optimization in text-to-image generation. We propose the first framework integrating curriculum learning into direct preference optimization (DPO). Methodologically, we leverage a reward model to rank generated samples and quantify pairwise difficulty via ranking discrepancies, enabling dynamic construction of preference pairs ordered from easy to hard. These pairs are then used to progressively optimize diffusion or consistency models in stages—without reinforcement learning—thereby achieving gradual, fine-grained alignment improvements. Our approach achieves state-of-the-art performance across three major benchmarks: text alignment fidelity, aesthetic quality, and human preference. It significantly enhances model generalization and controllability under complex, multi-faceted preference constraints, outperforming existing fine-tuning methods comprehensively.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on three benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-EE14.

Problem

Research questions and friction points this paper is trying to address.

Enhances Direct Preference Optimization for text-to-image generation

Uses curriculum learning to rank and train with difficulty-based pairs

Improves text alignment, aesthetics, and human preference in generated images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum learning enhances Direct Preference Optimization

Rank-based difficulty sampling for training pairs

Two-stage training with reward model ranking

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization