Hummer: Towards Limited Competitive Preference Dataset

📅 2024-05-19

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing preference datasets suffer from intrinsic conflicts among multiple alignment objectives, rendering models vulnerable to jailbreaking attacks and hindering prioritized adaptation to downstream tasks. To address this, we propose the Alignment Dimension Conflict (ADC) metric—a novel statistical measure that quantifies inter-objective alignment conflict for the first time. We further introduce Hummer, the first low-conflict preference dataset explicitly designed for alignment decoupling; it comprises two variants—Hummer (hybrid-sampled from UltraFeedback and GPT-4 AI feedback) and its fine-grained subset Hummer-F. Accompanying this dataset, we train dedicated reward models: HummerRM and HummerRM-F. Experiments demonstrate that Hummer significantly reduces intra-dataset alignment conflict, markedly improving reward model robustness, cross-domain transferability, and resistance to jailbreaking. Our work establishes a new paradigm for multi-objective alignment modeling grounded in conflict-aware dataset curation and decoupled objective learning.

Technology Category

Application Category

📝 Abstract

Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present exttt{Hummer} and its fine-grained variant, exttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. exttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.

Problem

Research questions and friction points this paper is trying to address.

Quantifying conflicts in preference datasets' alignment objectives

Developing reduced-conflict pairwise preference datasets

Creating reward models balancing diverse alignment objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Alignment Dimension Conflict metric

Created Hummer datasets with reduced-conflict objectives

Developed hybrid sampling reward models

🔎 Similar Papers

No similar papers found.