🤖 AI Summary
This paper identifies the “reference mismatch” problem in text-to-image diffusion model alignment—where preference-based methods like DPO suffer substantial performance degradation when the reference model diverges from the target model. To address this, we propose **Margin-aware Preference Optimization (MPO)**, a reference-free framework that models pairwise preferences via the Bradley–Terry model and directly optimizes the likelihood margin between preferred and dispreferred samples, eliminating reliance on a reference model altogether. MPO introduces the first margin-optimization paradigm for diffusion alignment, with gains amplifying as reference mismatch intensifies. Experiments demonstrate that MPO consistently outperforms DPO and DreamBooth across five key alignment tasks: safe generation, style transfer, cultural representation, personalization, and general-purpose alignment. Moreover, MPO achieves 15% faster training and reduced GPU memory consumption.
📝 Abstract
Modern preference alignment methods, such as DPO, rely on divergence regularization to a reference model for training stability-but this creates a fundamental problem we call"reference mismatch."In this paper, we investigate the negative impacts of reference mismatch in aligning text-to-image (T2I) diffusion models, showing that larger reference mismatch hinders effective adaptation given the same amount of data, e.g., as when learning new artistic styles, or personalizing to specific objects. We demonstrate this phenomenon across text-to-image (T2I) diffusion models and introduce margin-aware preference optimization (MaPO), a reference-agnostic approach that breaks free from this constraint. By directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model without anchoring to a reference, MaPO transforms diverse T2I tasks into unified pairwise preference optimization. We validate MaPO's versatility across five challenging domains: (1) safe generation, (2) style adaptation, (3) cultural representation, (4) personalization, and (5) general preference alignment. Our results reveal that MaPO's advantage grows dramatically with reference mismatch severity, outperforming both DPO and specialized methods like DreamBooth while reducing training time by 15%. MaPO thus emerges as a versatile and memory-efficient method for generic T2I adaptation tasks.