2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Traditional direct preference optimization (DPO) relies on single preference pairs, while existing curriculum learning approaches (e.g., Curriculum-DPO) model only response discriminability, neglecting prompt semantic complexity. To address these limitations, we propose DC-DPO, a two-dimensional curriculum learning framework. Our method jointly models prompt complexity and response-pair discriminability to define a dual difficulty metric; introduces an optional strategy space and a KL-divergence-driven adaptive reference model update mechanism; and integrates dynamic KL regularization with semantic-complexity-aware training. Extensive evaluation on MT-Bench, Vicuna, WizardLM, and UltraFeedback demonstrates that DC-DPO significantly outperforms standard DPO and Curriculum-DPO—achieving state-of-the-art performance on UltraFeedback. Ablation studies validate the effectiveness of both the two-dimensional difficulty modeling and the adaptive mechanisms.

Technology Category

Application Category

📝 Abstract

Aligning large language models with human preferences is crucial for their safe deployment. While Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning from human feedback, traditional DPO methods are limited by their reliance on single preference pairs. Recent work like Curriculum-DPO integrates multiple pairs using a one-dimensional difficulty curriculum based on pairwise distinguishability (PD), but overlooks the complexity of the input prompt itself. To address this, we propose 2D-Curri-DPO, a novel framework employing a two-dimensional curriculum that jointly models Prompt Complexity (PC) and Pairwise Distinguishability. This framework introduces dual difficulty metrics to quantify prompt semantic complexity and response preference clarity, defines a curriculum strategy space encompassing multiple selectable strategies for task adaptation, and incorporates a KL-divergence-based adaptive mechanism for dynamic reference model updates to enhance training stability. Comprehensive experiments demonstrate that 2D-Curri-DPO significantly outperforms standard DPO and prior curriculum methods across multiple benchmarks, including MT-Bench, Vicuna Bench, and WizardLM. Our approach achieves state-of-the-art performance on challenging test sets like UltraFeedback. Ablation studies confirm the benefits of the 2D structure and adaptive mechanisms, while analysis provides guidance for strategy selection. These findings demonstrate that effective alignment requires modeling both prompt complexity and pairwise distinguishability, establishing adaptive, multi-dimensional curriculum learning as a powerful and interpretable new paradigm for preference-based language model optimization.

Problem

Research questions and friction points this paper is trying to address.

Aligning large language models with human preferences efficiently

Overcoming limitations of single preference pairs in DPO methods

Addressing prompt complexity and pairwise distinguishability jointly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-dimensional curriculum learning for DPO

Dual difficulty metrics for prompt and response

KL-divergence adaptive reference model updates

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization