π€ AI Summary
Reward models are prone to relying on superficial shortcut cues during training, leading to fragile preference modeling. To address this issue, this work proposes DynaCF, a novel framework that introduces, for the first time, a dynamic counterfactual sensitivity measurement mechanism. During training, DynaCF continuously evaluates the sensitivity of each sample pair to counterfactual perturbations that preserve semantic content and dynamically down-weights highly sensitive samples in the BradleyβTerry objective. This approach enables online identification and suppression of shortcut learning by integrating counterfactual perturbation generation, dynamic reweighting, and real-time sensitivity tracking. Extensive experiments demonstrate that DynaCF significantly enhances the robustness of reward models and effectively mitigates their dependence on spurious correlations.
π Abstract
Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.