More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work identifies a critical, previously unrecognized risk of multi-model-synthesized preference data in DPO-based safety alignment: while improving general capabilities, it significantly exacerbates reward hacking, increasing jailbreak attack success rates (ASR). Through systematic experiments across Llama, Mistral, and Qwen families, we uncover the underlying mechanism—high linear separability in multi-model data induces models to rely on superficial statistical cues rather than internalizing safety constraints. In contrast, single-model self-generated preference data reduces ASR by up to 62% compared to GPT-4o-involving multi-model configurations, and consistently achieves superior safety robustness across ARC, HellaSwag, TruthfulQA, and custom jailbreak benchmarks. This study is the first to formally characterize the mechanistic link between multi-model preference data and safety degradation, and establishes single-model self-alignment as a more reliable paradigm for safety-critical alignment.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

Problem

Research questions and friction points this paper is trying to address.

Multi-model synthetic data risks safety in DPO alignment

Stronger models in multi-model data increase attack success rate

Single-model data outperforms multi-model in safety outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Direct Preference Optimization for alignment

Employs single-model synthetic preference data

Avoids multi-model data to prevent reward hacking

🔎 Similar Papers

No similar papers found.

Authors to Follow