DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing safety alignment methods for large language models often struggle to balance global safety objectives with dataset-specific residual risks when trained on multiple datasets, due to redundant preference data and the neglect of directional information. The authors propose a training-free data selection framework that, for the first time, models preference pairs as geometric directions in the model’s representation space. By decomposing this space into a global anchoring subspace and dataset-specific residual subspaces, the method decouples and diversely covers alignment directions. Using only 11% of the original preference data, the approach achieves performance comparable to full-data training across six safety benchmarks and two model backbones, significantly outperforming current baselines while maintaining high computational efficiency.

📝 Abstract

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

preference data

data selection

multi-dataset

directional preference

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-aware selection

preference direction

subspace decomposition