🤖 AI Summary
This work addresses the challenge of gradient dilution in conventional sequence-level preference optimization for structured generation tasks, where preference and rejected samples often differ only at critical schema tokens. To enable precise alignment at these decisive points, the authors propose a token-level preference optimization framework that integrates a confusion-aware negative sample construction strategy with a confidence-gated adaptive margin mechanism, specifically targeting ontological decision errors. The approach is compatible with mainstream large language model architectures such as Llama and Qwen. Evaluated on the SciERC benchmark, it achieves an 11.59% absolute improvement in key semantic label and relation linking metrics over strong baselines, surpassing the current state-of-the-art by 14.71%, while also enhancing text grounding capabilities.
📝 Abstract
Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.