🤖 AI Summary
Existing preference optimization methods (e.g., DPO) evaluate alignment solely based on a single response, neglecting other plausible outputs within the hypothesis space—leading to biased and incomplete assessment. To address this, we propose HEAL, a framework that reformulates preference alignment as a re-ranking problem over the hypothesis space, jointly optimizing for ranking accuracy and correlation with preference strength. We introduce UniHypoBench—the first unified hypothesis-space benchmark—to systematically expose inherent limitations of mainstream methods in negative-sample suppression and surrogate-model preference capture. Extensive experiments demonstrate that HEAL effectively diagnoses deficiencies in current algorithms, revealing trade-offs between diversity and preference fidelity. Our framework provides both theoretical insights and empirical evidence to guide the design of next-generation alignment methods that holistically balance preference satisfaction and output diversity.
📝 Abstract
Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a extbf{H}ypothesis-based Pr extbf{E}ference-aware extbf{A}na extbf{L}ysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.