2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization

๐Ÿ“… 2025-04-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional direct preference optimization (DPO) relies on single preference pairs, while existing curriculum learning approaches (e.g., Curriculum-DPO) model only response discriminability, neglecting prompt semantic complexity. To address these limitations, we propose DC-DPO, a two-dimensional curriculum learning framework. Our method jointly models prompt complexity and response-pair discriminability to define a dual difficulty metric; introduces an optional strategy space and a KL-divergence-driven adaptive reference model update mechanism; and integrates dynamic KL regularization with semantic-complexity-aware training. Extensive evaluation on MT-Bench, Vicuna, WizardLM, and UltraFeedback demonstrates that DC-DPO significantly outperforms standard DPO and Curriculum-DPOโ€”achieving state-of-the-art performance on UltraFeedback. Ablation studies validate the effectiveness of both the two-dimensional difficulty modeling and the adaptive mechanisms.

Technology Category

Application Category

๐Ÿ“ Abstract
Aligning large language models with human preferences is crucial for their safe deployment. While Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning from human feedback, traditional DPO methods are limited by their reliance on single preference pairs. Recent work like Curriculum-DPO integrates multiple pairs using a one-dimensional difficulty curriculum based on pairwise distinguishability (PD), but overlooks the complexity of the input prompt itself. To address this, we propose 2D-Curri-DPO, a novel framework employing a two-dimensional curriculum that jointly models Prompt Complexity (PC) and Pairwise Distinguishability. This framework introduces dual difficulty metrics to quantify prompt semantic complexity and response preference clarity, defines a curriculum strategy space encompassing multiple selectable strategies for task adaptation, and incorporates a KL-divergence-based adaptive mechanism for dynamic reference model updates to enhance training stability. Comprehensive experiments demonstrate that 2D-Curri-DPO significantly outperforms standard DPO and prior curriculum methods across multiple benchmarks, including MT-Bench, Vicuna Bench, and WizardLM. Our approach achieves state-of-the-art performance on challenging test sets like UltraFeedback. Ablation studies confirm the benefits of the 2D structure and adaptive mechanisms, while analysis provides guidance for strategy selection. These findings demonstrate that effective alignment requires modeling both prompt complexity and pairwise distinguishability, establishing adaptive, multi-dimensional curriculum learning as a powerful and interpretable new paradigm for preference-based language model optimization.
Problem

Research questions and friction points this paper is trying to address.

Aligning large language models with human preferences efficiently
Overcoming limitations of single preference pairs in DPO methods
Addressing prompt complexity and pairwise distinguishability jointly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-dimensional curriculum learning for DPO
Dual difficulty metrics for prompt and response
KL-divergence adaptive reference model updates
๐Ÿ”Ž Similar Papers
No similar papers found.