🤖 AI Summary
DPO relies on isolated, independently generated winning/losing response pairs, resulting in weak semantic correlation between them and limiting alignment performance. To address this, we propose BMC, the first framework to explicitly model fine-grained response associations in pairwise preference learning. BMC operates in two complementary ways: (1) it synthesizes semantically consistent pseudo-winning responses conditioned on reference winning responses to enhance signal coherence; and (2) it dynamically weights the loss at the token level using token-wise confidence scores derived from the policy model, enabling confidence-aware, token-level association learning. BMC is fully compatible with existing DPO variants and achieves significant improvements over strong baselines across QA, mathematical reasoning, and instruction-following tasks. Ablation studies and quantitative analysis confirm that gains stem directly from BMC’s enhanced capacity to model response correlations.
📝 Abstract
Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC.